.  PATTERN  THEORY:  AN  ENGINEERING 
PARADIGM  FOR  ALGORITHM  DESIGN 


T.  ROSS 
M.  NOVISKEY 
T.  TAYLOR 
D.  GADD 


Applications  Branch 
Mission  Avionics  Division 


26  July  1991 


Final  Report  for  Period  October  1988-  October  1990 


OTIC 

ELECTE 


Approved  for  public  release;  distribution  unlimited 


AVIONICS  DIRECTORATE 

WRIGHT  LABORATORY 

AIR  FORCE  SYSTEMS  COMMAND 

WRIGHT -PATTERSON  AIR  FORCE  BASE,  OHIO  45433-6543 


91-17379 


HOTICE 


When  Government  drawings,  specifications  or  other  data  are  used  for  any  purpose 
other  than  In  connection  with  a  definitely  Government-related  procurement,  the 
United  States  Government  Incurs  no  responsibility  nor  any  obligation  whatsoever. 
The  fact  that  the  government  may  have  formulated,  or  In  any  way  supplied  the  said 
drawings,  specifications,  or  other  data.  Is  not  to  be  regarded  by  Implication  or 
othexrwlse  In  any  manner  construed,  as  licensing  the  holder  or  any  other  person 
or  corporation,  or  as  conveying  any  rights  or  permission  to  manufacture,  use,  or 
sell  any  patented  Invention  that  may  In  any  way  be  related  thereto. 

This  report  Is  releasable  to  the  National  Technical  Information  Seirvlce  (NTIS) . 
At  NTIS,  It  will  be  available  to  the  general  public.  Including  foreign  nations. 


This  technical  report  has  been  reviewed  and  Is  approved  for  publication 


TIMOTHY  dO  : 


TIMOTHY  DU  ROSS 
Project;  Engineer 
System  Concepts  Group 


E.  JACOBS  (/Ihief 
astern  Concepts  Group 
Applications  Branch 


FOR  THE  COMMANDER 


FLOYD  p:  JOHNSON,  Chief 
Applications  Branch 
Mission  Avionics  Division 


If  your  address  has  changed,  if  you  wish  to  be  removed  from  our  mailing  list  or 
if  the  addressee  is  no  longer  employed  by  your  organization,  please  notify 
WL/AART-2,  Wright-Patterson  AFB,  OH  45433-6542  to  help  us  maintain  a  current 
mailing  list. 

Copies  of  this  report  should  not  be  returned  unless  return  is  required  by 
security  considerations,  contractual  obligations,  or  notice  on  a  specific 
document. 


REPORT  DOCUMENTATION  PAGE 


Form  Approved 
0MB  No.  0704-0188 


Public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  i  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources, 
gathering  and  maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this 
collection  of  information,  including  suggestions  for  rMucing  this  burden  to  Wa^tngton  Headquarters  Services.  Directorate  for  information  Operations  and  Reports.  121S  Jefferson 
Oavis  Highway.  Suite  1204.  Arlington.  VA  222024302.  and  to  the  Office  of  Management  and  Budget.  Papei^ork  Reduction  Project (0704*01BS).  Washington.  DC  20503. 


1.  AGENCY  USE  ONLY  (Leave  blank)  1 2.  REPORT  DATE 

I  26  07  91 


4.  Ti7Lc  AND  SUBTITLE 


3.  REPORT  TYPE  AND  DATES  COVERED 
_ Final  Oct  88 


5.  FUNDING  NUMBERS 


Pattern  Theory:  An  Engineering  Paradigm  for  Algorithm 
Desisn 


6.  AUTHOR(S) 

Timothy  D.Ross,  Michael  J.  Noviskey,  Timothy  N.  Taylor, 
and  David  A.  Gadd 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  AODRESS(ES) 
Avionics  Directroate,  WL,  AFSC 
WL/AART-2 

Wright -Patterson  AFB  OH  45433-6543 
Timothy  D.  Ross,  et,  al.  (513)  255-3215 


9.  SPONSORING /MONITORING  AGENCY  NAME(S)  AND  AODRESS(ES) 


WU  76290207 
PE  62204F 
PR  7629 
TA  02 
WU  07 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


10.  SPONSORING /MONITORING 
AGENa  REPORT  NUMBER 


12*.  DISTRIBUTION /AVAILABILITY  STATEMENT 


12b.  DISTRIBUTION  CODE 


Approved  for  public  release;  distribution  unlimited 


13.  ABSTRACT  (Maximum  200  words) 

This  report  proposes  "Pattern  Theory"  as  a  basis  for  an  engineering  theory  of 
algorithm  design.  Pattern  Theory  (PT)  begins  with  a  general  statement  of  the 
problem  and  then  makes  deliberate  specializations.  The  problem  of  finding  a  pattern 
in  a  function  is  the  essence  of  algorithm  design.  The  key  to  PT  is  its  measure  of 
pattern-ness :  Decomposed  Function  Cardinality  (DFC) .  Low  DFC  indicates  pattern- 
ness.  The  principal  result  is  a  demonstration  of  the  generality  with  which  DFC 
measures  pattern-ness.  This  generality,  is  supported  theoretically  by  relating  DFC 
to  time  complexity,  program  length  and  circuit  complexity.  A  test  is  developed, 
based  on  DFC,  for  whether  or  not  a  function  will  decompose.  This  test  is  used  in 
Ada  Function  Decomposition  (AFD)  programs.  AFD  produces  a  decomposition  (i.e.  an 
algorithm  in  combinational  form)  and  DFC.  The  generality  of  DFC  is  also  supported 
experimentally.  The  Pattern  Theory  approach  to  machine  learning  and  data  compres¬ 
sion  demonstrated  greater  generality  than  other  approaches.  The  DFC's  of  over  800 
nonrandom  functions  (numeric,  symbolic,  string  based,  graph  based,  images  and  files) 
were  measured.  Roughly  98%  of  the  nonrandom  functions  had  low  DFC  versus  less  than 
IX  for  random  functions.  AFD  found  the  classical  aleorithms  for  several  functions. 


14.  SUBJECT  TERMS  15.  NUMBER  OF  PAGES 

Algorithm,  Pattern  Recognition,  Function  Decomposition,  Machine  213 

Learning,  Computational  Complexity,  Computers,  Representation,  16.  PRICE  CODE 
Program  Length.  Extrapolation.  Computing  Theory.  _ 


17.  SECURITY  CLASSIFICATION  18.  SECURITY  CLASSIFICATION  119.  SECURITY  CLASSIFICATION  1 20.  LIMITATION  OF  ABSTRACT 

OF  REPORT  OF  THIS  PAGE  I  OF  ABSTRACT  I 


UNCLASSIFIED 


NSN  7540-01-280-5500 


UNCLASSIFIED 


UNCLASSIFIED 


Standard  Form  298  (Rev.  2-89) 

PfMctibed  by  ANSI  Std  239-18 
298-102 


Acknowledgements 


There  were  many  people  who  contributed  to  this  project.  Prof.  Alan  Lair  of 
the  Air  Force  Institute  of  Technology  and  Messrs  Devert  Wicker  and  Steve  Thomas 
of  Wright  Laboratory  acted  as  consultants.  During  the  summer  of  1989  we  had 
four  temporary  employees  involved  in  the  project.  Mr.  Mike  Findler,  (Arizona  State 
University)  worked  here  under  the  Air  Force  Office  of  Scientific  Research  (AFOSR) 
Graduate  Summer  Research  Program  (GSRP)  [18].  Mr.  Chris  Vogt  (Harvey  Mudd 
College),  Ms.  Tina  Normand  (Miami  University,  Ohio)  and  Mr.  John  Langenderfer 
(Wright  State  University)  were  all  summer  hires.  Mr.  Vogt  implemented  the  Ada  code 
for  the  function  decomposition  algorithms  and  wrote  the  user’s  guide  (Appendix  B). 
Ms.  Normand  developed  several  of  the  combinatorics  results  of  Section  5.2.  Mr.  Lan¬ 
genderfer  performed  the  analysis  of  the  relationship  between  the  pattern-ness  of  a 
function  and  the  pattern-ness  of  its  inverse.  The  summer  of  1990  brought  even  more 
help.  There  were  three  participants  in  the  AFOSR  Summer  Faculty  Research  Program 
(SFRP).  Prof.  Mike  Breen  of  Alfred  University  found  and  corrected  several  problems 
in  the  development  of  the  Basic  Decomposition  Condition  (see  [8]  and  Section  5.2) 
and  proved  the  set  intersection  size  result  of  Section  6.6.  Prof.  Thomas  Abraham 
of  Saint  Paul’s  College  performed  the  Perceived  Patten-ness  experiment  (see  [1]  and 
Section  6.5).  Prof.  Thomas  Gearhart  of  Capitol  University  worked  mostly  on  the  Con¬ 
vergence  Method  [25];  however,  he  performed  most  of  the  Neural  Net  experiments  of 
Section  6.6  and  made  important  contributions  through  participation  in  meetings  and 
personal  discussions.  Ms.  Shannon  Spittler  (Miami  University,  Ohio)  contributed  in 
several  areas  as  a  summer  hire,  especially  in  the  generation  and  reduction  of  data. 
Messrs  Mark  Boeke  and  Michael  Chabinyc  were  both  made  available  by  the  AFOSR 
High  School  Apprenticeship  Program.  They,  with  Lt  Taylor,  developed  the  pattern 
phenomenology  database  software  and  produced  many  of  the  initial  results  [5,  11]. 
In  addition  to  those  who  made  direct  technical  contributions,  several  persons  had 
important  roles  in  the  project.  Mr.  Leslie  Lawrence  of  Wright  Lab’s  Plans  Office 
coordinated  all  of  our  AFOSR  support.  Ms.  Peggy  Saurez  provided  prompt  and  pro¬ 
fessional  secretarial  support  whenever  needed.  Mr.  John  E.  Jacobs  was  the  immediate 
supervisor  of  all  the  authors  and  his  guidance  was  essential  to  the  project.  Reviews 
by  the  next  higher  level  of  management,  first  Mr.  Arther  A.  Duke  and  then  Mr.  F. 
Paul  Johnson,  kept  the  project  on  track.  Support  from  the  next  higher  level  of  man¬ 
agement,  first  Mr.  Edward  Deal  and  then  Mr.  Les  McFawn,  was  essential  in  that  they 


DO 


f 


Contents 


1  Introductipn;  1 

2  Backgrouiid  5 

2.1  Pattern  Theory  ..  ..  ^  ..  ..  ..  ..  ..  ..  ..  ..  •  ....  .  .  •  •.  •  •  .,  •  .  .  •  5 

2.1.1  Introduction,  to,  Pa.Ue.?n  Theory  ^  ,  .,  .  .  ,  .  .  .  .  ,,  .  ^  ,  .  .  ,  5. 

2.1.2  What,  is  a,  Pattern?  ..  .  .  .,  ^  ..  ..  <  *.  *.  •.  v  *.  ..  ?  ..  ..  *.  •.  5. 

2.2  Background  of  Related  Risciplines,  .  ^  .  .  ,  12 

2.2.1  Recognising  Patte.rns.  —  the  Many  Disciplines  .  ,  ,  .  .  .  .  ,  .  12 

2.2.2  Pattern  Re.c.ogniUon  »  ^  ^  !  12 

2.2.3  Artificial  InteUigenc.a  ^  ^  ,,.,5  .  13, 

2.2.4  Algorithm  Ptisign  ^  j  ,  15 

2.2.5  Computahility  ^  ,  15 

2.2.6  Computational  Qomple>dty  t  t  ..  t  1  t  t  16 

2.2.7  Data  Compression-  t  >.  t  .  15 

2.2.8  Cryptography  ,  .  17 

2.2.9  Switching  Theory  17 

2.2.10  Summary  17 

3  The  Pattern  Theory  Paradigin  l^f 

3.1  Why  is  Pattern  Theory  Needed?  19 

3.1.1  Offensive  Avionics  as  a  Potential  Application  19 

3.1.2  Importance  of  Computing  Ppwer  in  Offensive  Avipmps  .  .  :  .  19 

3.1.3  Importance  of  Algpri.thm?  in  Computing  Power  .........  20 

3.1.4  Role  of  a  ‘^Design  Thfpry’*  ....................  22 

3.1.5  The  Need  for  a  Design  Theory  for  Algorithms  ..........  2:^ 

3.1.6  Summary  24 

3.2  The  Pattern  Theory  Approach  . . 24 

3.2.1  The  “Given  and  Find”  Chaiaptorization  of  a  Design  Thepjy  _.  24 

3.2.2  Definition)  Analysis  and  Speglalization  25 

3.3  The  General  Problem  of  Computational  System  Design  26 

3.3.1  Computation  and  Functions  ......................  26 

3.3.2  Representation . 27 

3.3.3  Figuresrpf-Merit  . . 28 

3.3.4  Problem  Statement  . . 29 

■y 


3.4  The  Pattern  Theory  1  Problem  as  a  Special  Problem  in  Computational 

System  Design .  30 

3.4.1  Special  Rather  Than  General  Purpose  Computers .  31 

3.4.2  Single  Function  Realization  .  31 

3.4.3  Input  Representation  System . ■ .  31 

3.4.4  Output  Representation  System .  32 

3.4.5  Functions .  32 

3.4.6  Definition  versus  Realization .  33 

3.4.7  Figure-of-Merit  for  the  PT  1  Problem .  33 

3.4.8  Kinds  of  Patterns .  34 

3.4.9  Problem  Statement . 34 

3.5  Summary  .  34 

4  Decomposed  Function  Cardinality  as  a  Measure  of  Pattern-ness  37 

4.1  Introduction . . .  37 

4.2  Decomposed  Function  Cardinality .  47 

4.3  Decompositions  Encoded  as  Programs  .  48 

4.3.1  Introduction .  48 

4.3.2  Encoding  Procedure .  49 

4.3.3  Length  of  an  Encoding .  50 

4.3.4  %'  Includes  All  Optimal  Representations .  51 

4.3.5  Properties  of  Encodings  .  52 

4.3.6  Decomposed  Function  Cardinality  and  Program  Length  ....  53 

4.4  Decomposed  Function  Cardinality  and  Time  Complexity .  55 

4.5  Decomposed  Function  Cardinality  and  Circuit  Complexity .  58 

4.6  Summary  .  59 

5  Function  Decomposition  61 

5.1  Introduction .  61 

5.2  The  Basic  Decomposition  Condition .  62 

5.2.1  Introduction .  62 

5.2.2  An  Intuitive  Introduction  to  the  Decomposition  Condition  .  .  62 

5.2.3  The  Formal  Basic  Decomposition  Condition .  69 

5.2.4  Non-Trivial  Basic  Decompositions .  73 

5.2.5  Negative  Basic  Decompositions .  76 

5.3  The  Ada  Function  Decomposition  Programs .  79 

5.3.1  Program  Functional  Description .  80 

5.3.2  Program  Software  Description .  82 

5.3.3  Versions  of  the  AFD  Algorithm .  82 

5.4  Ada  Function  Decomposition  Program  Performance .  88 

5.4.1  Cost  Reduction  Performance .  89 

5.4.2  Run-Time  Performance . . . .  93 

5.4.3  Summary .  99 

5.5  Summary  .  99 


VI 


6  Pattern  Phenomenology  101 

6.1  Introduction . .  101 

6.2  Randomly  Generated  Functions .  102 

6.2.1  Introduction .  102 

6.2.2  Completely  Random  Functions  .  103 

6.2.3  Functions  with  a  Specific  Number  of  Minority  Elements  ....  108 

6.2.4  Functions  with  a  Specific  Number  of  Don’t  Cares .  110 

6.3  Non- randomly  Generated  Functions .  113 

6.3.1  Numerical  Functions  and  Sequences .  114 

6.3.2  Language  Acceptors  . .  122 

6.3.3  String  Manipulation  Functions .  123 

6.3.4  A  Graph  Theoretic  Function .  127 

6.3.5  Images  as  Functions  .  129 

6.3.6  Data  as  Functions .  130 

6.3.7  Summary .  136 

6.4  Patterns  as  Perceived  by  People  . .  139 

6.4.1  Effect  of  the  Order  of  Variables  on  the  Pattern-ness  of  Images  139  • 

6.5  Pattern-ness  Relationships  for  Related  Functions .  142 

6.5.1  Functions  and  Their  Complements  .  142 

6.5.2  Functions  and  Their  Inverses .  143 

6.6  Extrapolative  Properties  of  Function  Decomposition .  145 

6.6.1  Introduction  .  . . .  145 

6.6.2  FERD  Experiments .  146 

6.6.3  FERD  and  Neural- Net  Comparisons  .  154 

6.6.4  FERD  Theory .  156 

6.6.5  Summary .  162 

6.7  Summary  .  162 

7  Conclusions  and  Recommendations  163 

8  Summary  165 

A  Program  Length  and  the  Combinatorial  Implications  for  Computing  167 

A.l  Mathematical  Preliminaries  .  167 

A. 1.1  Basic  Definitions .  167 

A. 1.2  Combinatorics .  168 

A. 2  Program  Length  Constraints  for  Computation .  171 

A. 2.1  Introduction . 171 

A.2.2  Programmable  Machines .  172 

A. 2.3  Ma:dmum-Minimum  Program  Length  for  Finite  and  Transfinite 

Sets  .  176 

A. 2.4  Average-Minimum  Program  Length  Bound  for  Finite  Sets  .  .  183 

A. 3  Summary  .  191 


193 


B  Function  Decomposition  Program  User’s  Guide 


viii 


List  of  Figures 

2.1  The  Algorithm  Design  Process .  6 

2.2  The  Grand  Scheme  for  Algorithm  Design .  7 

2.3  Pattern  Theory  Phase  1 .  7 

2.4  Pattern  Theory  Phase  2 .  8 

2.5  Pattern  Theory  Phase  3 .  8 

2.6  Pattern  Theory  Phase  4 . 9 

2.7  Patterned  and  Un-Patterned  Objects .  11 

2.8  Neural  Net  Paradigm . 14 

2.9  Model-Based  Reasoning  Paradigm .  16 

3.1  Time  Complexity  of  Algorithms .  21 

3.2  Typical  Algorithm  Input  Sizes .  22 

4.1  /  as  a  Composition  of  Smaller  Functions .  39 

4.2  Decomposition  of  Addition .  40 

4.3  Decomposition  of  a  Palindrome  Acceptor .  41 

4.4  Decomposition  of  a  Prime  Number  Acceptor .  43 

4.5  Decomposition  of  an  Image .  44 

4.6  An  Image  of  “R” .  44 

4.7  Similar  Decompositions,  One  Recursive,  One  Not . 57 

5.1  Form  of  a  Decomposition .  68 

5.2  Form  of  a  More  General  Decomposition . . .  68 

5.3  Example  Decomposition .  70 

5.4  Relationship  Between  v  and  [Vi],  where  D{f)  =  [/] .  75 

5.5  The  Basic  Decomposition .  75 

5.6  DECOMP  JIECORD  Data  Structure .  80 

5.7  FINDXOWEST.COST  Flow  Chart .  81 

5.8  FIND-LOWEST-COST  Psuedo-Code .  82 

5.9  Algorithm  Stages .  83 

5.10  Compilation  Dependencies .  84 

5.11  NU.MAX  for  Each  Version  of  the  AFD  Algorithm .  86 

5.12  Neural  Net  Gross  Architecture . 91 

5.13  Detailed  Architecture  of  a  Neural  Net  Component .  92 

5.14  Specific  NN  Architectures .  92 


IX 


5.15  Run-time  versus  DFC  for  Functions  on  Eight  Variables .  97 

5.16  Run-Time  versus  Number  of  Minority  Elements  for  Six  Variable  Func¬ 
tions  .  97 

5.17  Run-Time  versus  Number  of  Minority  Elements  for  Seven  Variable 

Functions .  98 

5.18  Run-Time  versus  Number  of  Minority  Elements  for  Eight  Variable 

Functions .  98 

6.1  Number  of  Functions  versus  DFC  for  n  up  to  24 .  106 

6.2  Number  of  Functions  versus  DFC  for  n  =  5 .  107 

6.3  DFC  With  Respect  to  Number  of  Minority  Elements,  n=4 .  110 

6.4  DFC  With  Respect  to  Number  of  Minority  Elements,  n=5 . Ill 

6.5  DFC  With  Respect  to  Number  of  Minority  Elements,  n.-6 . Ill 

8.6  DFC  With  Respect  to  Number  of  Minority  Elements,  n— 7 . 112 

6.7  DFC  With  Respect  to  Number  of  Minority  Elements,  n=:8 .  112 

6.8  DFC  as  a  Function  of  the  Number  of  Cares .  113 

6.9  DFC  as  a  Function  of  the  Number  of  Cares,  n  =  7 .  114 

6.10  Font  0  Images  and  DFC .  131 

6.11  Font  1  Images  and  DFC .  131 

6.12  Font  2  Images  and  DFC .  132 

6.13  Font  3  Images  and  DFC . 132 

6.14  Font  4  Images  and  DFC .  132 

6.15  Variable  Permutations  for  Characters  177  and  197  of  Font  0 . 140 

6.16  Variable  Permutations  for.  Characters  15  and  1  of  Font  0  .  . . 140 

6.17  Variable  Permutations  for  Characters  10  of  Font  0  and  48  of  Font  2  .  141 

6.18  Variable  Permutations  for  Characters  51  of  Font  2  and  31  of  Font  3  .  141 

6.19  Relationship  Between  Functions  of  a  Given  DFC  and  the  Average  DFC 

of  Their  Inverses  .  144 

6.20  Learning  Curve  for  XOR  P\inction .  147 

6.21  Learning  Curve  for  Parity  Function .  147 

6.22  Learning  Curve  for  Majority  Gate  Function  .  147 

6.23  Learning  Curve  for  a  Random  Function  with  Four  Minority  Elements  148 

6.24  Learning  Curve  for  the  Symmetric  Function .  148 

6.25  Learning  Curve  for  Primality  Test  on  Seven  Variables .  148 

6.26  Learning  Curve  for  Primality  Test  on  Nine  Variables  .  149 

6.27  Learning  Curve  for  a  Random  Function .  149 

6.28  Learning  Curve  for  Font  1  “P” .  149 

6.29  Learning  Curve  for  Font  1  “T” .  150 

6.30  Learning  Curve  for  Font  0  “R” .  150 

6.31  Learning  Examples  for  the  Parity  Function .  151 

6.32  Learning  Examples  for  the  Letter  “R”  152 

6.33  The  Pattern  Theory  Logo  .  152 

6.34  Learning  Curves  for  Random  Functions  on  4  Through  10  Variables  .  .  153 

6.35  Number  of  Samples  Required  for  <  10  errors .  153 


X 


6.36  Neural  Net  Learning  Curve  for  XOR  Function .  154 

6.37  Neural  Net  Learning  Curve  for  Parity  Function  .  154 

6.38  Neural  Net  Learning  Curve  for  Majority  Gate  Function .  155 

6.39  Neural  Net  Learning  Curve  for  the  Symmetric  Function .  155 

6.40  Second  Neural  Net’s  Learning  Curve  for  the  Step  Function .  157 

6.41  Second  Neural  Net’s  Learning  Curve  for  the  Majority  Gate  Function  .  157 

A.l  A  Machine’s  Interfaces .  172 

A.2  “Programs”  in  a  Communications  Context .  173 

A.3  Simplified  RAM  Model .  174 

A.4  BASIC  Allows  for  Tabular  Data  Structures  . .  187 

A.5  An  Example  Table  Machine .  190 


List  of  Tables 

1.1  Find  an  Algorithm  for  This  Function . 2 

1.2  Find  an  Algorithm  for  This  Function . 2 

2.1  Recognizing  a  Pattern  in  a  Function .  9 

2.2  Recognizing  a  Pattern  in  a  Function .  10 

4.1  Function  Cardinality  of  h  is  8 . J .  38 

4.2  The  Function  Cardinality  of  /  and  j  is  16 .  38 

4.3  Functions  that  Compose  / . 39 

4.4  Addition  on  Six  Variables  (Four  Output  Functions) .  40 

4.5  Addition  Components  ai  and  Ci  (XOR  and  AND) .  40 

4.6  Addition  Components  a2  and  C2 .  41 

4.7  Palindrome  Acceptor  on  Six  Variables .  42 

4.8  Palindrome  Acceptor  Component  U]  (NOT  XOR) .  42 

4.9  Palindrome  Acceptor  Component  b  (AND) .  42 

4.10  Prime  Number  Acceptor  Components  Oj  and  . 43 

4.11  Letter  R  Components  c  and  d . 43 

4.12  Letter  R  Component  a .  46 

4.13  Letter  R  Component  6 . 45 

5.1  A  Table  Representation  of  a  Function .  63 

5.2  A  2-D  Table  of  a  Function  With  Respect  to  a  Partition  of  its  Variables  64 

5.3  A  Second  2-D  Table  of  a  Function  With  Respect  to  a  Partition  of  its 

Variables .  64 

5.4  A  Table  Representation  of  a  Function .  64 

5.5  A  2-D  Table  of  a  Function  With  Respect  to  a  Partition  of  its  Variables  65 

5.6  A  Table  Representation  of  Function  g .  66 

5.7  A  Partition  Matrix  (2-D  Table)  of  p .  66 

5.8  A  Partition  Matrix  of  g  With  ^  Defined  .  66 

5.9  g  Defined  by  G  and  (f) .  67 

5.10  Various  Forms  of  ^ .  68 

5.11  Partition  Matrices .  69 

5.12  Functions  /  and  g .  76 

5.13  AFD  Algorithm  Version  Space .  88 

5.14  Average  DFC  for  Set  A  and  Set  B .  90 

xii 


5.15  DFC  of  NN  Like  Architectures .  91 

5.16  AFD-DFC  of  NN  Like  Architectures  .  93 

5.17  Average  Run-time  for  Set  A  and  Set  B .  94 

5.18  Run-times  for  Functions  That  Did  Not  Decompose  .  95 

5.19  Run-Times  for  Functions  That  Did  Decompose .  96 

6.1  Number  of  Functions  for  a  Given  DFC .  105 

6.2  Number  of  Minority  Elements  Required  for  a  Given  Cost .  Ill 

6.3  Addition .  115 

6.4  Subtraction .  116 

6.5  Multiplication . 116 

6.6  Modulus . 117 

6.7  Remainder .  117 

6.8  Square  Root .  118 

6.9  Cube  Root . • .  118 

6.10  Sine . 119 

6.11  Logarithm .  119 

6.12  Miscellaneous  Numerical  Functions .  119 

6.13  Pnmality  Tests, . 120 

6.14  Fibonacci  Numbers .  120 

6.15  DFC  of  Lucas  Functions .  121 

6.16  DFC  of  Binomial  Coefficient  Based  Functions .  121 

6.17  DFC  of  Greatest  Common  Divisor  Function .  122 

6.18  DFC  of  the  Determinant  Function .  122 

6.19  Sample  Languages .  124 

6.20  DFC  of  Language  Acceptors . 125 

6.21  Miscellaneous  String  Manipulation  Functions .  125 

6.22  Sorting  Eight  1-Bit  Numbers .  126 

6.23  DFC  of  Sorting  Four  2-Bit  Numbers .  126 

6.24  Input  Bits  Represent  Arcs .  127 

6.25  Additional  Input  Bits  for  Arcs  to  Self .  128 

6.26  DFC  of  the  Various  k-clique  Functions  on  a  Graph  With  5  Nodes  .  .  128 

6.27  DFC  of  the  Various  k-clique  Functions  on  a  Graph  With  Four  Nodes  129 

6.28  Turbo  Pascal  V5.5  Font  Sets .  129 

6.29  Character  Images  DFC  Statistics  .  130 

6.30  DFC  and  Data  Compression  Results  for  Typical  Files .  134 

6.31  Data  Compression  Summary  for  Typical  Files .  134 

6.32  Data  Compression  Summary  for  Atypical  Files .  135 

6.33  Decomposition  Summary  for  Non-Randomly  Generated  Functions  .  .  137 

6.34  Larger  n  Shows  Greater  Decomposability .  138 

6.35  Character  Images .  140 

6.36  Permutations  of  Variables .  141 

6.37  Number  of  Functions  and  Inverses  with  a  Given  Cost  Combination  .  .  144 

6.38  FERD  (F)  and  NN  (N)  Error  Comparison .  156 


Xlll 


A.l  Fraction  of  Functions  Computable  by  NN 


XIV 


Chapter  1 
Introduction 


Can,  you  invent  an  algorithm  for  the  function  defined  in  Table  1.1?  That,  is,  can  you 
write  a  computer  program  that  generates  f[x)  when  given  x  and  not  use  a  brute 
force  table  look-up?'  What  about  the  function  in  Table  1.2?*  Think  about  how  you 
invented  these  algorithms.  Were  your  a.lgorithms  ba,sed  on  a  pattern  in  the  function? 
For  example,  did  you  notice  that  for  the  first  example  the  output  is  1  if  and  only  if 
the  input,  taken  as  a  string,  is  symnaetric  about  its  center?  Yfh&t  do  you  think  about 
computers  finding  patterns  like  these?  It  seems  that  some  people  are  surprised  that 
computers  cannot  already  do  this.  If  we  know  ahead  of  time  that  the  function  h|s 
some  specific  structure  then  we  can  write  a  program  to  fine  tune  that  structure;  but^ 
we  do  not  have  computers  that  can  find  basic  structures  in  a  very  generd  setting. 
Others  are  surprised  that  someorie  WQuld  even  suggest  that  computers  inight  be  able 
to  do  this.  The  invention  of  algorithms  has  been  equated  with  scientific  discovery  (e.g. 
[32]  which  makes  one  balk  at  the  idea  of  automating  algorithm  desijgn.  We  believe 
that  algorithm  design  is  at  most  a  subset  of  scientific  discovery  and  that  it  is  a  subset 
that  can  be  automated.  Further,  we  believe  that  the  first  step  towards  automation 
is  to  develop  a  solid  theoretical  understanding  of  this  pattern  finding  ability  that 
characterizes  algorithm  design.  This  theoretical  understanding  must  in  turn  be  built 
on  a  solid  understanding  of  “pattern,” 

The  algorithms  in  use  today  were  invented  by  people.  There  are  other  similar 
engineering  products,  such  as  estimation  systems,  control  systems,  comnrunication 
systems,  that  were  designed  by  people,  but  with  a  fundanxeiitally  different  dependence 
upon  the  cleverness  of  the  designers.  That  is,  in  the  traditional  engineering  problems, 
there  is  an  engineering  theory  that  guides  the  designer.  People  rnust  invent  new 
algorithms  without  the  aid  of  an  engineering  theory.  The  difference  between  the 
algorithm  engineering  problem  and  many  other  engineering  problems  is  reflected  iii 
the  difference  between  “invent”  and  “design.”  Webster  [66]  defines  “invent:” 

“...to  produce  ...through  the  use  of  the  imagination  or  of  ingenious 

thinking  ...” 

‘One  algorithm  is  to  treat  the  first  2  bits  as  one  number  and  the  second  2  bits  as  a  second  number 
and  then  /(*)  is  the  arithmetic  sum  of  these  two  numbers. 


1 


Table  1.2:  Find  an  Algorithm  for  This  Function 


2 


and  “design:” 

. .  to  create,  fashion,  execute,  or  construct  according  to  plan  ...” 

It  seems  that  algorithms  are  invented  while  estimation  systems,  control  systems, 
etc.  are  designed.  We  believe  that  the  difference  between  invention  and  design  is 
simply  the  existence  of  an  engineering  theory.  We  need  an  engineering  theory  to 
allow  algorithm  design. 

This  report  introduces  “Pattern  Theory.”  Pattern  Theory  consists  of  a  formal 
definition  of  pattern  (or  structure),  an  approach  to  finding  the  pattern  when  it  exists, 
and  a  characterization  of  various  phenomena  with  respect  to  this  structure.  The 
principal  objective  of  this  report  is  to  demonstrate  that  many  kinds  of  practically 
important  patterns  are  well  reflected  in  this  formal  definition. 

Chapter  2  describes  the  need  for  an  engineering  theory  of  algorithm  design.  Chap¬ 
ter  3  describes  Pattern  Theory  which  is  our  approach  to  a  design  theory  for  algorithms. 
The  key  to  our  approach  is  a  measure  of  algorithm  good-ness  that  we  call  Decomposed 
Function  Cardinality  (DFC).  Chapter  4  defines  this  measure  and  relates  it  to  the  more 
conventional  measures.  Function  Decomposition  is  the  method  for  optimizing  with 
respect  to  DFC.  Chapter  5  develops  the  theory  behind  function  decomposition  and 
describes  computer  programs  for  accomplishing  decompositions.  We  equate  the  ex¬ 
istence  of  a  good  algorithm  for  a  given  function  and  the  existence  of  a  “pattern”  in 
that  function.  So  the  design  of  a  good  algorithm  is  the  same  as  finding  the  pattern 
in  a  function  and  we  think  of  DFC  as  a  measure  of  the  pattern-ness  of  a  function. 
Chapter  6  reports  on  the  results  of  applying  this  measure  to  a  variety  of  functions; 
we  call  this  class  of  results  “Pattern  Phenomenology.” 


3 


4 


Chapter  2 
Background 


2.1  Pattern  Theory 

2.1.1  Introduction  to  Pattern  Theory 

The  development  of  Pattern  Theory  began  around  1986  at  the  Air  Force  Institute 
of  Technology  (AFIT),  Wright  Patterson  Air  Force  Base,  Ohio.  One  of  this  report’s 
authors,  then  on  Long-Term  Full-Time  trtdning.  Prof  Alan  V.  Lair,  of  AFIT’s  Math¬ 
ematics  Department  and  Prof  Matthew  Kabrisky,  of  AFIT’s  Electrical  Engineering 
Department,  all  played  major  roles  in  this  early  work.  A  discussion  of  many  of  the 
ideas  that  went  into  Pattern  Theory  was  published  in  [48,  50, 51].  The  name  “Pattern 
Theory”  was  adopted  after  the  International  Conference  on  Pattern  Recognition  in 
1988.  Our  paper  at  that  conference  was  in  a  session  entitled  “Fuzzy  Sets  and  Pattern 
Theory.”  All  the  other  papers  were  clearly  about  Fuzzy  Sets,  so  we  must  have  been 
the  Pattern  Theory.  A  team  of  AART  and  visiting  engineers  continued  the  Pattern 
Theory  work  in  the  in-house  Pattern  Based  Machine  Learning  (PBML)  project  whose 
results  are  the  subject  of  this  report.  The  PBML  Project  is  generally  referred  to  as 
Pattern.  Theory  1  (PT  1)  in  this  report. 

2.1.2  What  is  a  Pattern? 

An  Introduction  to  the  Pattern  Theory  Paradigm 

It  will  be  useful  to  briefly  introduce  the  Pattern  Theory  (PC)  paradigm  to  motivate 
the  background.  Chapter  3  is  a  detailed  introduction  to  the  PT  paradigm.  The  basic 
problem  is  how  do  you  go  from  a  definition  of  a  function  to  a  computer  realization 
of  that  function.  The  problem  has  some  definition  of  a  function  as  its  starting  point 
and  a  computer  algorithm  as  its  solution. 

We  divide  the  kinds  of  information  that  might  constitute  the  definition  into  two 
classes:  samples  of  the  function  and  “other”  information  about  the  function.  Fig¬ 
ure  2.1  represents  the  algorithm  design  problem.  The  grand  scheme  of  Pattern  Theory 
is  to  eventually  complicate  this  flow  chart  slightly  by  allowing  “learned”  algorithms 


5 


Figure  2,1:  The  Algorithm  Design  Process 

to  be  added  to  the  “other”  information.  By  closing  the  loop  we  create  an  iterative 
approach  to  realizing  more  and  more  complicated  functions.  Figure  2.2  represents  the 
iterative  approach  to  algorithm  design.  This  representation  will  be  useful  in  explain¬ 
ing  the  phases  of  Pattern  Theory  and  its  relationship  to  other  paradigms.  Figures  2.3 
through  2.6  represent  the  four  planned  phases  of  Pattern  Theory.  Pattern  Theory 
Phase  1  concerns  algorithm  design  by  function  decomposition  when  the  function  is 
defined  by  an  exhaustive  table.  Pattern  Theory  Phase  2  concerns  algorithm  design 
by  function  decomposition  when  the  function  is  defined  by  a  combination  of  samples 
and  limited  other  information.  Initially  the  other  information  will  simply  be  that 
the  function  has  limited  computational  complexity.  Pattern  Theory  Phase  3  concerns 
algorithm  design  by  function  decomposition  when  the  function  is  defined  by  a  com¬ 
bination  of  limited  samples  and  robust  other  information.  Pattern  Theory  Phase  4 
concerns  iteratively  designing  increasingly  complex  algorithms  by  function  decompo¬ 
sition.  This  report  is  concerned  with  the  results  of  the  first  phase.  The  second  phase 
(PT  2)  began  as  this  report  was  being  finished. 

While  there  is  no  general  theory  for  working  the  problem  of  algorithm  design  it 
has  been  recognized  that  finding  some  pattern  in  the  function  could  be  important 
(e.g.  “Perhaps  the  most  valuable  concept  of  all  in  the  invention  of  algorithms  is  that 
of  recognizing  patterns  ...”  [38]  or  “. . .  many  of  the  central  problems  of  behavior, 
intelligence,  and  information  processing  are  problems  that  involve  patterns.”  [62]). 
Pattern  Theory  is  an  attempt  to  formalize  this  pattern  finding  problem  within  the 
context  of  algorithm  design.  By  a  “pattern”  we  mean  the  structure,  order  or  regularity 
in  a  function.  Most  people  would  have  no  trouble  recognizing  the  patterns  in  the 


6 


Figure  2.2:  The  Grand  Scheme  for  Algorithm  Design 


Figure  2.3:  Pattern  Theory  Phase  1 


Figure  2.4:  Pattern  Theory  Phase  2 


Figure  2.5:  Pattern  Theory  Phase  3 


8 


4^  Algorithms 


Figure  2.6:  Pattern  Theory  Phase  4 


X 

m 

1 

1 

2 

4 

3 

9 

4 

16 

5 

25 

6 

36 

• 

• 

Table  2.1:  Recognizing  a  Pattern  in  a  Function 

functions  defined  by  Table  2.1*  and  Table  2.2^.  Pattern  Theory  concerns  the  problem 
of  recognizing  the  patterns  in  functions  that  will  allow  their  economical  computation. 
But,  what  is  a  pattern? 

Intuitive  Ideas  about  Patterns 

We  are  concerned  with  patterns  in  the  sense  of  regularity,  order,  structure  or  the 
opposite  of  chaos.  People  seem  to  have  a  common  sense  notion  of  pattern-ness. 
This  common  sense  notion  of  a  pattern  is  supported  by  people’s  wiUingness  to  assign 
a  pattern-ness  ranking  in  experiments  like  Garner’s  [24]  and  those  of  Section  6.4. 
Patterns  can  occur  in  many  different  forms.  Figure  2.7  has  examples  of  patterned 


•/(*)  = 

^Primality  test. 


X 

/(») 

1 

1 

2 

1 

3 

1 

4 

0 

5 

1 

6 

0 

7 

1 

8 

0 

9 

0 

10 

0 

11 

1 

12 

0 

* 

Table  2.2:  Recognizing  a  Pattern  in  a  Function 

and  unpatterned  images,  strings  of  letters,  and  sequences  of  numbers.  Again  on  an 
intuitive  level,  patterns  are  easier  to  remember;  for  example,  the  sequence 

17761812186519151941 

is  easier  to  remember  (if  you  recognize  the  pattern)  than  a  sequence  like 

73217519816234218192. 

Patterns  also  seem  to  be  easier  to  extrapolate;  for  example,  we  would  have  more 
confidence  in  guessing  the  next  number  in  the  sequence  2,4,6,8,10,12,...  than  in 
the  sequence  5, 2, 7, 3, 5, 12, . . . 

Traditional  Ideas  about  Patterns 

Although  there  seems  to  be  this  common  sense  notion  of  pattern-ness,  there  has  been 
little  success  in  capturing  this  notion  as  a  formal  mathematical  concept.  References 
[48,  51]  describe  our  assessment  of  the  traditional  formulations  of  pattern-ness. 

Patterns  and  Simplicity 

We  feel  that  the  most  useful  direction  for  exploring  pattern-ness  is  the  one  which  re¬ 
lates  pattern-ness  and  simplicity  of  description.  Simplicity  is  the  opposite  of  complex¬ 
ity  and  computational  complexity  has  a  well  developed  theory.  Therefore,  through 
this  connection  to  complexity,  pattern-ness  immediately  has  a  rich  theory. 


10 


▼ 


abccbaabccba 

0,1,1,2,3,5,8,13,21... 


accbabcbbcaa 

0,2,5,6,9,14,17,20,... 


Figure  2.7:  Patterned  and  Un-Patterned  Objects 
The  Relativity  Problem 

A  problem  arises  though  because  the  theory  is  almost  too  “rich.”  That  is,  there  are 
many  measures  of  complexity  and  pattern-ness  is  then  relative  to  the  measure  used. 
Pattern  Theory  addresses  this  relativity  problem  by  proposing  that  there  is  a  special 
model  of  a  computer  and  a  measure  that  reflects  the  essence  of  complexity  in  the 
sense  of  patterns. 

In  a  sense  we  have  gone  full  circle.  We  started  with  the  problem  of  flnding  eco¬ 
nomical  representations  of  a  function  (i.e.  an  algorithm).  We  decided  that  recognizing 
patterns  is  important  in  this  endeavor.  Now  we  are  saying  that  recognizing  patterns 
is  essentially  the  same  as  finding  economical  representations.  Why  even  bring  up  the 
concept  of  patterns?  The  answer  lies  in  the  need  for  a  concept  of  general  computa¬ 
tional  complexity  that  does  not  currently  have  a  name.  This  needed  concept  closely 
reflects  the  intuitive  iiotion  of  a  pattern  so  that  is  what  we  call  it.  We  also  like  the 
connection  this  gives  the  problem  to  the  early  pattern  recognition  work.  This  early 
work  in  pattern  recognition  formed  the  basis  for  many  current  artificial  intelligence 
problems.  When  you  consider  the  problem  of  algorithm  design  as  simply  one  of  min¬ 
imizing  computational  complexity,  the  temptation  to  choose  a  specific  non-general 
measure  of  complexity  is  too  strong.  We  lose  sight  of  the  idea  of  finding  the  ba¬ 
sic  structure  (i.e.  pattern)  in  the  function.  As  a  practical  matter,  we  could  develop 
all  the  “Pattern  Theory”  concepts  in  terms  of  traditional  computational  complexity. 
However,  by  talking  about  patterns  we  feel  we  more  easily  focus  on  the  general  or 
abstract  complexity  which  is  so  important  and  it  ties  us  into  disciplines  which  we 
think  are  quite  relevant. 


11 


2.2  Background  of  Related  Disciplines 

2.2.1  Recognizing  Patterns  —  the  Many  Disciplines 

In  the  following  we  will  survey  the  disciplines  relevant  to  Pattern  Theory.  Perhaps  the 
most  obvious  discipline  is  pattern  recognition  (e.g,  [16,  22,  29]).  However,  the  modern 
approaches  to  pattern  recognition  do  not  treat  patterns  in  our  special  sense.  This 
position  is  developed  in  [48,  51].  Early  pattern  recognition  research  was  concerned 
with  special  patterns,  as  are  elements  of  modern  research  (e.g.  [58,  63]).  At  one 
time.  Pattern  Recognition  (PR)  and  Artificial  Intelligence  (AI)  research  had  a  great 
deal  in  common.  This  common  philosophical  base  is  quite  relevant  to  Pattern  Theory. 
However,  the  specific  disciplines  within  PR  and  AI  (e.g.  statistical  pattern  recognition, 
syntactic  pattern  recognition,  expert  systems,  neural  nets)  seem  to  have  all  diverged 
from  the  core  problem.  In  all  these  disciplines,  the  basic  structure  of  the  problem  must 
be  recognized  by  the  designer  without  theoretical  tools  or  automation.  Only  after  this 
basic  structure  is  defined  can  theoretical  tools  or  anything  approaching  automation 
be  applied.  Data  Compression  (e.g.  [27])  can  be  considered  as  a  problem  of  finding 
and  exploiting  patterns  in  data.  This  has  an  obvious  connection  with  our  problem. 
Within  the  data  compression  discipline  the  patterns  are  recognized  by  the  designer  of 
the  data  compression  routine,  again  without  theoretical  tools  or  automation.  As  we 
have  already  mentioned,  the  complexity  and  computability  disciplines  of  theoretical 
computer  science  are  most  related  to  Pattern  Theory.  We  will  make  extensive  use 
of  computational  complexity  results.  We  will  also  show  that  computability  is  a  sub¬ 
problem  of  complexity  and  of  no  special  interest  within  our  context  (see  Appendix  A). 
Finally,  the  problem  of  designing  electronic  circuits  (switching  theory)  is  connected 
to  Pattern  Theory.  We  will  see  that  with  respect  to  our  generalized  measure  of 
complexity,  designing  efficient  circuits  and  designing  efficient  algorithms  are  the  same 
problem.  As  you  would  expect,  both  problems  depend  on  finding  some  pattern  in 
the  function  to  be  realized.  We  make  extensive  use  of  function  decomposition  theory 
with  was  originally  developed  within  the  switching  theory  context. 


2.2.2  Pattern  Recognition 

The  relationship  between  the  traditional  field  of  pattern  recognition  and  Pattern 
Theory  is  discussed  in  depth  in  [48,  49].  The  following  is  a  brief  summary  of  that 
discussion. 

The  subject  of  pattern  recognition  can  be  divided  up  many  ways.  The  most 
common  is  to  consider  the  fields  of  statistical  (also  decision-theoretic,  geometric  or 
vector  space)  pattern  recognition,  syntactic  (also  structural  or  linguistic)  pattern 
recognition  and  fuzzy  methods  of  pattern  recognition.  The  references  [48,  49]  use 
a  slightly  different  division,  emphasizing  the  role  of  a  priori  structure  in  designing 
recognizers.  The  a  priori  structure  is  the  representation  system  or  language  used  to 
express  the  recognition  algorithm.  Pattern  Theory  is  an  attempt  to  generalize  this 
idea  of  a  priori  structure.  Therefore,  the  role  of  a  priori  structure  v;ii;nin  traditional 


12 


pattern  recognition  is  especially  relevant.  Most  traditional  pattern  recognition  is 
based  on  either  a  geometric  or  a  syntactic  structure.  Reference  [48]  discusses  the 
background  of  traditional  pattern  recognition  in  terms  of  these  two  structures. 

The  basic  disconnect  between  Pattern  Recognition  and  Pattern  Theory  lies  in  our 
belief  that  the  interesting  pattern  finding  phenomenon  occurs  in  the  design  of  recog¬ 
nition  systems  rather  than  in  their  operation.  Reference  [51]  explains  this  position. 
*  This  difference  in  perspective  is  reflected  in  the  different  approaches  to  research.  In 

Pattern  Recognition  it  is  generally  believed  that  a  researcher  should  choose  a  single 
,  realistic  problem  (typically  speech  or  character  recognition).  The  PT  approach  is  to 

study  many  simple  problems  (e.g.  Chapter  6  reports  on  over  1000  different  functions). 
The  concern  is  that  when  we  study  only  a  single  function,  the  researcher  ends  up  do¬ 
ing  the  pattern  finding  and  the  so-called  “pattern  recognition”  algorithm  is  simply 
a  realization  of  the  patterns  recognized  by  the  researcher.  Studying  many  different 
kinds  of  functions  makes  it  more  difficult  for  the  researcher  to  insert  (deliberately  or 
unconsciously)  any  humanly  recognized  patterns.  This  forces  the  machine  to  do  some 
true  pattern  finding. 

2.2.3  Artificial  Intelligence 

Machine  learning,  a  problem  of  artificial  intelligence  (AI),  might  be  thought  of  as 
an  attempt  to  automate  the  process  that  we  seek  to  understand.  That  is,  we  want 
to  understand  the  process  of  defining  an  algorithm  while  machine  learning  seeks 
to  automatically  generate  an  algorithm.  Therefore,  Pattern  Theory  has  a  strong 
connection  to  machine  learning. 

We  think  of  the  artificial  intelligence  approach  to  this  problem  as  one  of  figuring 
out  how  people  do  it  and  then  attempting  to  model  that  process  on  a  computer.  For 
example,  expert  systems  derive  from  the  cognitive  psychology  model  of  thought  and 
neural  nets  derive  from  the  physiological  model  of  the  hardware  involved  in  thought. 
It  is  possible  that  AI  will  come  up  with  useful  systems  based  on  this  approach  without 
any  understanding  of  the  process  at  an  abstract  level.  An  often  used  analogy  for  AI 
is  the  problem  of  manned  flight.  In  this  analogy  the  AI  approach  would  be  analogous 
to  the  artificial  bird  approach.  That  is,  we  could  design  machines  with  bird-like 
properties  since  a  bird  is  an  existing  system  which  performs  the  desired  function.  We 
are  trying  to  take  what  might  be  called  the  “Wright”  approach.  That  is  we  seek  to 
i  understand  the  basic  phenomenon  that  will  allow  us  to  design  from  first  principles. 

This  approach  will  not  immediately  lead  to  systems  with  practical  value;  however,  we 
believe  it  is  the  only  approach  to  continuing  long  term  improvements. 

In  AI  based  machine  learning, 

“The  human  engineer  specifies  a  weak  method  for  a  problem’s  solution 
that  is  (semi)  automatically  (...)  streamlined  by  the  system  with  experi¬ 
ence.” 

''From  Doug  Fisher’s  Tutorial:  Machine  Learning  and  its  Applications,  July  1990. 


13 


Figure  2.8:  Neural  Net  Paradigm 

We  feel  that  the  so  called  “weak  method”  constitutes  a  large  fraction  of  the  overall 
solution.  The  problem  addressed  in  Pattern  Theory  includes  the  development  of  a 
weak  method  as  well  as  the  “automatic  streamlining.” 

There  are  many  approaches  to  machine  learning.  Learning  within  the  context  of 
expert  systems  include  rule  learning  (e.g,  [13]),  adaptive  figure-of- merits  to  improve 
non-exhaustive  searches  (e.g.  [13]),  and  genetic  algorithms  (e.g.  [15]).  Some  neural 
nets  learn  [53].  Within  the  discipline  of  pattern  recognition  there  are  learning  methods 
for  both  geometric  (e.g.  [16])  and  syntactic  (e.g.  [22])  systems.  Adaptive  systems  (e.g. 
[37])  as  used  in  estimation  and  control  theory  for  non-linear  systems  have  as  many 
learning  characteristics  as  A1  systems. 

We  can  characterize  machine  learning  systems  using  the  diagram  in  Figure  2.1. 
The  “other”  information  includes  an  assumption  that  the  desired  function  has  a 
realization  of  the  form  used  by  the  learning  system.  Take  Neural  Nets  for  example 
(Figure  2.8),  the  “other  information”  is  an  assumption  that  the  desired  function 
may  be  represented  by  the  chosen  architecture  of  thresholded  linear  combinations. 
Appendix  A  demonstrates  that  this  assumption  is  surprisingly  restrictive.  The  design 
approach  then  is  back-propagation  or  some  other  method  of  assigning  weights.  The 
traditional  machine  learning  paradigms  are  built  around  a  specific  structure.  The  key 
idea  of  Pattern  Theory  is  that  we  want  to  find  the  structure  that  already  exists  in 
the  function.  We  do  not  want  to  try  to  force  fit  a  function  to  some  structure  that  we 
chose  ahead  of  time. 

One  AI  approach,  known  as  Abduction  [44],  uses  the  “chunking”  idea  of  Miller  [43]. 
The  function  decomposition  approach  of  Pattern  Theory  also  exhibits  this  chunking 
idea. 


14 


2.2.4  Algorithm  Design 

The  texts  on  algorithm  design  (e.g.  [3])  are  quite  different  from  the  texts  on  most 
other  electrical  engineering  design  problems  (e.g.  circuit  design,  control  system  de¬ 
sign,  communications  system  design).  Most  electrical  engineering  design  texts  tell,  in 
an  almost  cookbook  fashion,  how  to  solve  problems  of  a  given  type.  Typically  you 
begin  by  developing  a  dynamic  model  of  the  system  involved.  Next,  you  apply  some 
very  general  principles,  such  as  modulation  in  communications  or  feedback  in  con¬ 
trols.  Then  there  are  some  mathematically  rigorous  tools  for  optimizing  the  design. 
Finally  there  are  methods  for  predicting  performance  and  evaluating  the  design.  By 
contrast,  texts  on  algorithm  design  give  a  list  of  specific  algorithms  that  you  are  to 
mix  and  match  to  your  problem.  They  do  not  tell  you  how  to  come  up  with  a  new 
algorithm.  If  controls  texts  were  like  algorithm  design  texts  they  might  give  a  table  of 
feedback  gains  for  specific  plants  and  specific  desired  step  responses,  but  they  would 
not  give  the  general  relationship  between  feedback  gain  and  system  performance  that 
control  theory  actually  provides.  It  seems  that  if  an  engineer  with  a  good  under¬ 
standing  of  control  theory  were  to  compete  in  solving  a  new  controls  problem  with  an 
engineer  with  no  controls  background,  the  engineer  with  knowledge  of  control  theory 
would  arrive  at  a  much  better  design.  However,  if  two  engineers  were  to  compete 
at  discovering  a  new  algorithm,  the  engineer  with  a  background  in  algorithm  design 
would  seem  to  have  little  advantage  (unless,  of  course,  some  previously  discovered  al¬ 
gorithm  happened  to  fit  the  new  problem).  In  summary,  although  you  can  find  texts 
on  algorithm  design,  they  do  not  address  design  of  fundamentally  new  algorithms. 

In  the  introduction  of  an  algorithm  design  text  they  may  mention  a  general  prin¬ 
ciple  of  algorithm  design  know  as  “divide  and  conquer,”  e.g.  [7,  p.3].  The  function 
decomposition  approach  of  Pattern  Theory  can  be  thought  of  as  a  formalization  of 
the  divide  and  conquer  principle. 

An  important  technology  that  is  being  developed  and  used  in  the  Avionics  Di¬ 
rectorate  is  Model-Based  Reasoning,  especially  its  application  to  target  recognition. 
From  a  Pattern  Theory  perspective,  Model-Based  Reasoning  is  not  too  different  from 
traditional  algorithm  design.  Referring  again  to  Figure  2.1,  model-based  simply 
means  that  the  “other”  information  is  a  collection  of  models.  The  algorithm  de¬ 
sign  problem  is  classical;  that  is,  we  are  left  to  our  own  inventiveness  to  turn  the 
models  into  an  algorithm  (Figure  2.9). 

2.2.5  Computability 

The  problem  of  computability  would  seem  to  be  quite  relevant  to  Pattern  Theory. 
But  it  is  not.  Computability,  in  its  formal  sense,  is  tied  to  recursion. 

“. . .  because  all  evidence  indicates  that  the  class  of  partial  recursive  func¬ 
tions  is  exactly  the  class  of  effectively  computable  functions;  . . .”  [35] 

It  seems  clear  that  recursion  is  a  desirable  property  in  a  function,  but  it  is  neither  nec¬ 
essary  nor  sufficient  for  a  function  to  be  patterned.  We  say  this  because  all  functions 


15 


MODELS 


Figure  2.9:  Model-Based  Reasoning  Paradigm 

of  interest  in  practical  computing  are  finite;  all  finite  functions  are  partial  recursive; 
yet  finite  functions  are  not  practically  computable  with  high  probability.  There  are  of 
course  many  infinite  functions  (especially  those  on  the  real  or  natural  numbers)  that 
are  of  interest,  but  we  never  really  try  to  compute  them.  We  would  always  be  satis¬ 
fied  with  the  ability  to  compute  these  function  on  some  finite  sub-domain.  Therefore, 
the  use  of  (and  complete  dependence  on)  infinite  functions  for  interesting  results  in 
computability  makes  it  of  no  practical  use  in  Pattern  Theory.  We  will  argue  later 
that  recursion  is  of  secondary  importance  in  the  general  complexity  used  in  Pattern 
Theory.  Appendix  A  develops  some  classical  computability  results  from  a  Pattern 
Theory  perspective. 

2.2.6  Computational  Complexity 

As  we  have  mentioned,  “Pattern  Theory”  might  more  appropriately  be  an  un-named 
sub-set  of  computational  complexity  theory.  The  theory  of  computational  complexity 
(e.g.  [33,  54,  64])  has  well  developed  measures  of  complexity.  The  measures  used  in 
Pattern  Theory  are  a  special  case  of  these.  There  are  also  many  computing  theory  re¬ 
sults  in  what  we  call  Pattern  Phenomenology.  However,  complexity  theory  is  oriented 
towards  analysis  rather  than  the  synthesis  of  computational  systems.  Sections  4.4  and 
4.5  develop  the  relationship  between  conventional  measures  of  complexity  and  Pattern 
Theory. 

2.2.7  Data  Compression 

The  design  of  a  data  compression  system  depends  upon  recognizing  and  exploiting 
some  pattern  in  the  data.  However,  like  algorithm  design  texts,  data  compression 


texts  (e.g.  [27])  give  you  a  list  of  specific  procedures  for  some  common  patterns  that 
were  recognized  by  people.  They  do  not  tell  you  how  to  find  new  patterns  in  data. 


2.2.8  Cryptography 

Cryptography  is  concerned  with  patterns  in  sequences  rather  than  functions.  Al¬ 
though  any  mathematician  will  tell  you  that  a  sequence  is  a  function,  the  problem 
is  somewhat  different.  Pattern  theory  has  so  far  been  concerned  with  patterns  in 
functions.  Although  the  problem  of  breaking  codes  must  involve  pattern  finding  in 
the  sense  of  Pattern  Theory,  we  have  not  explored  how  Pattern  Theory  relates  to 
cryptography. 

2.2.9  Switching  Theory 

From  a  Pattern  Theory  perspective,  the  design  of  electronic  circuits  is  essentially  the 
same  as  algorithm  design.  Unlike  algorithm  design  though,  there  are  many  theoretical 
synthesis  tools.  There  seem  to  be  three  approaches  to  the  design  of  discrete  circuits, 
One  approach  (e.g.  [21]),  using  ROM  or  PLA’s,  uses  an  essentially  brute  force  table 
look-up.  This  approach  offers  no  special  insight  into  the  pattern  finding  problem.  A 
second  approach  is  to  design  optimal  two-level  circuits  [21],  This  approach  does  not 
capture  patterns  in  a  sufficiently  general  sense  because  some  highly  patterned  func¬ 
tions  (e.g.  the  parity  function)  do  not  have  efficient  two-level  realizations.  The  third 
approach  is  based  on  function  decomposition.  The  idea  of  function  decomposition 
has  been  around  a  long  time  (see  [4]),  but  it  has  had  a  limited  role  in  circuit  design. 
Function  decomposition  is  not  even  mentioned  in  many  standard  Switching  Theory 
texts  (e.g.  [21,  26,  45]).  When  function  decomposition  is  discussed  (e.g.  [60]),  there 
seems  to  be  general  agreement  that  function  decomposition  is  “prohibitively  labo¬ 
rious.”  We  believe  that  function  decomposition  gets  at  the  crux  of  computational 
complexity.  The  practical  difficulties  of  using  function  decomposition  for  circuit  de¬ 
sign  does  not  detract  from  its  central  theoretical  role.  If  nothing  else,  we  hope  that 
Pattern  Theory  will  contribute  to  the  realization  that  function  decomposition  is  a  (if 
not  the)  fundamental  problem  in  computer  science. 

2.2.10  Summary 

There  are  many  disciplines  that  are  relevant  to  Pattern  Theory.  As  Pattern  Theory 
matures  there  will  be  many  potential  areas  of  application.  There  are  also  many  results 
from  these  related  fields  that  are  useful  in  Pattern  Theory.  We  especially  use  some 
complexity  ideas  from  computing  theory  and  the  function  decomposition  idea  from 
switching  theory. 


17 


18 


Chapter  3 

The  Pattern  Theory  Paradigm 


3.1  Why  is  Pattern  Theory  Needed? 

This  section  will  attempt  to  motivate  the  Pattern  Theory  work.  This  motivation  is 
developed  by  picking  a  particular  problem,  discussing  the  importance  of  computing 
in  solving  this  problem,  discussing  the  role  of  algorithms  in  doing  computing,  and, 
finally,  discussing  the  need  for  a  theory  to  design  algorithms. 

3.1.1  Offensive  Avionics  as  a  Potential  Application 

The  Pattern  Theory  work  was  performed  in  the  Mission  Avionics  Division  of  the 
Avionics  Directorate  of  Wright  Laboratory  (WL/AART).  This  organization  has  of¬ 
fensive  avionics  algorithms  as  a  principal  product.  Therefore,  we  use  this  potential 
application  of  an  algorithm  design  theory  as  an  example  to  motivate  the  need  for  such 
a  theory.  The  arguments  used  here  could  have  been  couched  in  terms  of  any  one  of 
the  many  diverse  problems  requiring  algorithms  (see  Section  2.2).  We  chose  offensive 
avionics  algorithms  because  we  are  most  familiar  with  this  application  and  it  helps 
explain  why  it  is  appropriate  for  Pattern  Theory  work  to  be  done  in  this  organization. 

3.1.2  Importance  of  Computing  Power  in  Offensive  Avion¬ 
ics 

Offensive  avionics  (or  fire  control)  is  responsible  for  locating,  identifying  and  selecting 
targets,  appropriately  releasing  weapons  and  doing  this  in  the  most  survivable  manner 
possible.  In  order  to  better  understand  what  must  be  done  to  meet  the  responsibilities 
of  fire  control,  we  often  think  of  fire  control  as  a  family  of  functions.  These  functions 
serve  one  of  two  purposes.  Either  they  are  part  of  the  overall  sensor  system  or  they  are 
part  of  the  control  system.  The  sensor  system  attempts  to  determine  the  “state-of- 
the-world,”  which  includes  targets,  self,  threats,  cooperating  friendlies  and  anything 
else  that  could  be  a  factor.  The  control  system  manages  all  the  resources  of  the 
aircraft.  This  includes  deciding  on  the  specific  trajectory  for  the  aircraft,  managing 


19 


the  sensors,  as  well  as  managing  the  weapons  themselves. 

All  these  functions  have  always  been  part  of  the  fire  control  problem.  At  one 
time,  the  “weapon  system”  was  just  a  person.  This  person  formed  their  state-of-the- 
world  picture  from  what  they  could  see  and  hear.  They  moved  into  position  on  foot 
and  instinctively  planned  and  executed  their  “weapon  delivery”  (perhaps  a  punch  or 
kick).  Over  time,  people  began  to  use  artificial  weapons,  at  first  sticks  and  stones  but 
eventually  guns  and  bombs.  We  began  to  use  artificial  sensors  such  as  telescopes  and 
radars.  We  also  developed  artificial  means  of  locomotion,  beginning  with  horses  and 
eventually  leading  to  airplanes.  We  have  added  these  increasingly  sophisticated  ma¬ 
chines  to  a  person  always  trying  to  improve  the  overall  weapon  system  performance. 
Until  recently,  the  extremely  adaptive  nature  of  people  has  allowed  them  to  do  their 
state-of-the- world  assessment,  their  planning  and  control  functions  and  to  use  these 
machines  effectively.  However,  there  has  been  an  explosion  in  the  complexity  of  the 
weapon  systems.  Now  we  not  only  have  an  artificial  sensor,  we  have  multiple  sensors, 
each  capable  of  measuring  multiple  attributes  of  many  targets.  Our  artificial  weapons 
now  include  many  types,  some  with  long  range  and  many  degrees  of  flexibility.  Our 
means  of  getting  about  have  become  faster  and  more  maneuverable. 

At  first  we  tried  to  deal  with  the  increasing  complexity  by  putting  more  people 
in  the  system.  The  crew  size  for  bombers  was  six  when  we  built  the  B-52.  Then, 
as  computers  and  software  technology  became  available  we  began  to  deal  with  the 
complexity  more  and  more  through  aids  and  automation.  The  crew  of  the  B-1  was 
down  to  four  and  the  B-2  has  only  two  crew  members. 

What  technology  has  allowed  the  crew  size  to  decrease  despite  an  increase  in 
the  complexity  of  the  task?  What  technology  may  eventually  allow  the  crew  size 
to  go  to  zero?  The  crew  provides  no  useful  work  in  the  force  times  distance  sense. 
Their  sensory  capabilities,  in  terms  of  being  able  to  resolve  and  detect  light,  sound  or 
acceleration  could  be  easily  replaced.  People  are  in  modern  combat  aircraft  for  one 
reason:  their  computing  power.  Therefore,  it  is  fair  to  say  that  computing  power  is 
an  extremely  important  technology  for  avionics  systems. 


3.1.3  Importance  of  Algorithms  in  Computing  Power 

In  the  preceding  section  we  discussed  the  importance  of  computing  power.  Now  we 
want  to  discuss  how  important  algorithms  are  in  overall  computing  power.  We  can 
think  of  computing  power  as  being  made  up  of  three  technologies.  One  technology 
is  computing  hardware.  Fairly  good  measures  of  hardware  capability  exist  in  terms 
of  Instructions  per  Second,  Operations  per  Second,  etc.  There  has  been  tremendous 
growth  in  computing  hardware  technology.  In  addition  to  hardware,  effective  compu¬ 
tation  requires  software.  We  like  to  think  of  this  software  as  being  developed  in  two 
stages.  First  there  must  be  some  algorithm  that  describes  the  desired  computation  at 
an  abstract  level.  Then  this  algorithm  must  be  implemented  in  a  specific  computer 
language.  We  consider  the  first  problem  to  be  algorithm  design  and  the  second  prob¬ 
lem  to  be  software  engineering.  These  problems  are  not  entirely  separable,  just  as  the 
hardware  and  software  problems  are  not  entirely  separable;  however,  it  is  useful  to 


20 


I.OOdE+10 
1.000E+09 
1.000E+08 
10000000 
1000000 
100000 
10000 
1000 
100 
10 
1 

0.1 

0  2  4  6  8  10  12  14  16  18  20  22  24  26  28  30 

Number  of  Input  Bits 
- Exponential  Cubic 

Figure  3.1:  Time  Complexity  of  Algorithms 

break  out  algorithm  design  so  we  can  concentrate  on  it  without  having  to  address  spe¬ 
cific  implementations.  There  are  some  pretty  good  measures  of  algorithm  goodness, 
but  not  for  how  well  software  engineering  is  doing  nor  for  overall  computing  power, 
Although  we  cannot  do  so  quantitatively,  we  still  want  to  point  out  the  special  role 
of  algorithm  power. 

The  expected  computational  complexity  (time  complexity,  program  length,  num¬ 
ber  of  devices  in  a  circuit,  how  ever  you  want  to  measure  it)  of  an  algorithm  for  an 
arbitrary  problem  goes  up  exponentially  with  respect  to  the  size  of  the  input  (see 
Section  A.2.4).  Even  poor  algorithms  are  low  order  polynomial  complexity.  The  dif¬ 
ference  between  having  even  a  poor  algorithm  and  no  algorithm  becomes  tremendous 
for  problems  of  even  modest  size  input  (see  Figure  3.1).  Figure  3.2  shows  typical 
input  sizes  for  some  problems  of  interest.  Consider  the  difference  between  having 
and  not  having  an  algorithm  and  then  consider  the  potential  for  hardware  or  software 
engineering  to  make  up  for  this  difference.  Take  an  estimator  as  an  example:  not  hav¬ 
ing  an  algorithm  could  only  be  compensated  for  by  hardware  or  software  engineering 
through  an  increased  capability  of  times.  This  demonstrates  how  ridiculous  it 
is  to  even  think  about  computing  most  functions  without  a  good  algorithm. 

Some  specific  examples  further  demonstrate  the  payoff  in  having  a  good  algorithm. 
As  mentioned  earlier,  estimation  theory  has  an  important  role  in  fire  control.  In 
realizing  most  estimators  it  is  necessary  to  invert  a  matrix.  One  test  for  whether 
or  not  the  inverse  of  a  matrix  exists  is  to  compute  the  matrix’s  determinant.  To 
compute  the  determinant  by  standard  recursion  (as  in  the  usual  definition  of  the 
determinant)  the  run-time  goes  up  as  the  factorial  of  the  matrix  size.  However,  with 
the  Gauss-Jordan  Elimination  algorithm  it  is  possible  to  compute  the  determinant 
with  complexity  nlogn  (see  [7,  9]).  As  an  example  of  what  this  means,  if  computing 
the  determinant  of  a  20  x  20  matrix  takes  on  the  order  of  50  milliseconds  by  the  Gauss- 


Runtime 


21 


Application 


Number  ot  Binary  Variables 


Algorithm  Design  :to  evaluate  i6— — . ibiz 

Estimator  ia<-io<z 

ATR  105 

IFFN  Fusion  io4 

Chnraclor  Rucoonlllon  loi 

Missile  Envelope  up 

Control  l^w  sso-103 

Library  Functions  it . S4 


Data  Compression  ;  to  store  is- . 4e 

Video  32-->4s 

Audio  10 . 32 

Large  S/W  or  Data  Bases  i*—. 2» 

Pictures  13 . 27 


Airborne/PC  S/W  or  Data  13-21 


Digital  Circuit  Design  4 . is 

, - 

0  10  20  30  40  so  100  105  105  lot  1012 


Figure  3.2;  Typical  Algorithm  Input  Sizes 

Jordan  Elimination  method  then  it  would  take  on  the  order  of  10  million  years  by  the 
standard  recursion  method.  Sorting  a  list  with  Insertion  sort  has  complexity  (570 
minutes  to  sort  100,000  elements)  while  Quick  sort  has  complexity  nlogn  (30  seconds 
to  sort  100,000  elements).  The  relatively  recent  invention  of  algorithms  like  Quick 
Sort  has  brought  about  many  of  the  word  processing  features  that  we  use  everyday. 
Even  minor  improvements  in  an  algorithm  can  have  dramatic  effects.  For  example,  the 
invention  of  the  Fast  Fourier  Transform  (FFT)  algorithm  only  reduced  the  complexity 
from  to  nlogn  (see, (46)).  However,  without  the  FFT  algorithm,  today’s  real  time 
digital  Synthetic  Aperture  Radar  (SAR)  capability  could  only  be  achieved  with  a 
hardware  throughput  improvement  of  about  five  orders  of  magnitude.  Note  that  the 
FFT  and  Quick  Sort  algorithms  were  “invented.”  Without  an  engineering  theory, 
things  are  invented.  With  an  engineering  theory,  things  are  designed. 

Therefore,  good  algorithms  are  very  important  in  effective  computing  and  in  a 
real  sense  more  important  than  hardware  or  software  engineering. 

3.1.4  Role  of  a  ^^Design  Theory” 

We  have  gone  from  recognizing  the  need  for  computing  power  to  the  need  for  algo¬ 
rithms;  now  we  want  to  recognize  the  need  for  an  engineering  theory  to  help  design 
algorithms.  But  before  we  do  that,  we  review  the  role  of  an  engineering  theory  in 
design. 

Although  there  are  some  particular  well  established  engineering  design  theories 
(e.g.  Modern  Control  Theory  or  Estimation  Theory)  there  does  not  seem  to  be  much 
literature  on  these  kinds  of  theories  in  general.  There  is  a  body  of  literature  on 
methods  to  improve  the  creativity  of  designers  (e.g.  [6, 17]).  There  is  also  some  work 
on  a  theory  about  design  (e.g.  [28]).  However,  these  do  not  treat  “design  theory” 


22 


in  the  desired  sense.  The  most  relevant  literature  about  engineering  design  concerns 
“optimal  design”  (e.g.  [47,  57]).  In  this  literature  the  design  process  is  one  of  defining 
a  model,  establishing  the  criteria  for  a  good  design  and  then  optimizing  the  design 
with  respect  to  that  criteria.  While  most  of  the  traditional  optimal  design  theories 
have  quantitative  criteria  and  specific  methods  for  optimization,  they  would  be  of 
value  even  without  that.  A  good  design  theory  tells  you  what  is  important  about  a 
class  of  problems,  tells  you  about  some  absolute  limits  on  performance,  allows  you 
to  predict  performance,  and  gives  you  some  specific  steps  towards  solving  a  class  of 
problems.  For  example,  estimation  theory  (e.g.  [36])  tells  you  that  it  is  important  to 
model  the  dynamic  behavior  of  the  system  (i.e.  x  =  Ax  +  Bit)  and  the  measurement 
process  (i.e.  z  =  Hx  +  Gw)  as  well  as  the  specific  form  of  an  optimal  estimator  based 
on  these  models.  Estimation  theory  also  allows  you  to  determine  any  observability 
limitations.  Ideally,  a  design  theory  would  have  a  formal  structure.  As  Melsa  and 
Cohn  [39]  say  in  regards  to  decision  and  estimation  theory: 

“Although  we  treat  such  problems  intuitively  all  the  time,  it  is  important 
that  we  cast  them  into  a  more  definite  mathematical  model  in  order  to 
develop  a  rigorous  structure  for  stating  them,  solving  them,  and  evaluating 
their  solution.” 

By  a  mathematical  model  we  would  not  necessarily  mean  a  numerical  model,  only 
that  the  inodel  have  a  formal  logical  structure. 

A  good  design  theory  is  not  the  solution  to  any  particular  problem;  rather,  it  is  a 
tool  useful  in  solving  a  whole  class  of  problems. 

3.1.5  The  Need  for  a  Design  Theory  for  Algorithms 

Historically,  Electrical  Engineering  design  theories  (especially  estimation  and  control 
theory)  have  been  used  to  develop  fire  control  algorithms.  However,  the  modern  fire 
control  problem  requires  a  large  variety  of  algorithms.  Many  of  tiiese  problems  are 
either  not  naturally  representable  as  estimation  or  control  problems  or  the  solutions 
provided  by  these  traditional  theories  are  computationally  intractable.  For  example, 
the  determination  of  an  aircraft  trajectory  for  attacking  multiple  ground  targets  in  a 
single  pass  can  be  set  up  as  an  optimal  controls  problem.  However,  because  closed 
form  optimal  solutions  cannot  be  found,  this  leads  to  a  computationally  impractical 
design.  Further,  the  problem  of  selecting  a  trajectory  for  the  attack  of  multiple 
airborne  targets  cannot  even  be  set  up  as  a  reasonable  controls  problem.  The  point 
we  are  trying  to  make  is  that  there  is  a  need  for  a  more  general  theory  of  algorithm 
design.  Our  recognition  of  this  need  arose  in  considering  fire  control  problems  but 
the  need  is  pervasive  in  the  application  of  computing  power. 

With  the  extensive  literature  on  algorithms  it  seems  surprising  that  there  is  not 
a  general  theory  of  algorithm  design.  However,  most  of  this  literature  is  concerned 
with  the  analysis  of  algorithms  rather  than  their  design.  Even  the  literature  on 
algorithm  design  typically  does  not  discuss  how  to  create  an  algorithm;  rather  they  tell 
you  how  to  apply  known  algorithms  in  various  situations.  When  algorithm  creation 


23 


is  discussed,  it  is  in  terms  of  “discover”  or  “invent”  rather  than  design  (e.g.  “The 
‘discovery’  by  Cooley  and  Tukey  in  1965  of  a  fast  algorithm  ...”  [7J  or  “The  creation 
of  an  algorithm  . . . ,  is  an  inventive  process  ...”  [38]  ). 

Once  connected  with  the  problem  of  “discovery”,  we  begin  to  wonder  if  a  design 
theory  for  algorithms  is  even  possible.  Reference  [31]  argues  that  it  is  not  only 
possible  to  have  a  theory  of  the  discovery  process  but  that  it  is  possible  to  automate 
the  process.  While  we  think  they  are  correct,  this  report  is  concerned  with  simply 
trying  to  understand  algorithm  design  in  a  formal  theoretical  sense.  We  feel  that  a 
thorough  understanding  of  the  problem  is  the  first  step  to  a  useful  design  theory  and 
that  design  assisted  by  an  engineering  theory  would  logically  precede  automation  of 
algorithm  design. 

3.1.6  Summary 

This  section  attempts  to  show  the  practical  relevance  of  Pattern  Theory.  We  began 
by  discussing  the  importance  of  computing  in  offensive  avionics;  although  the  impor¬ 
tance  of  computing  could  have  been  derived  from  many  sources.  We  then  point  out 
the  special  dependence  that  computing  power  has  on  algorithm  design.  Improved 
hardware  or  software  engineering  are  fine  tuning  compared  to  new  algorithms  which 
create  entirely  new  capabilities.  After  clarifying  what  we  mean  by  a  “design  theory,” 
we  explain  that  a  design  theory  for  algorithms  would  be  very  beneficial  and  that  such 
a  theory  does  not  currently  exist.  The  bottom  line  is  that  there  is  a  strong,  un-met, 
need  for  a  theory  of  algorithm  design. 


3.2  The  Pattern  Theory  Approach 

Pattern  Theory  is  an  attempt  at  an  engineering  design  theory  for  algorithms.  This 
section  will  present  the  algorithm  design  problem  in  a  way  consistent  with  an  engi¬ 
neering  theory.  We  begin  by  first  expanding  on  our  concept  of  a  design  theory. 


3.2.1  The  “Given  and  Find”  Characterization  of  a  Design 
Theory 

We  will  develop  our  concept  of  a  design  theory  in  terms  of  “givens”  and  “finds.”  Given 
and  Find  are  intermediate  stages  in  going  from  the  real  problem  to  the  real  solution. 
The  design  theory  provides  methods  for  relating  the  given  problem  statement  to  what 
we  want  to  find.  However,  there  always  remains  the  task  of  couching  the  real  design 
problem  into  a  simplified  problem  of  specific  givens  and  finds  such  that  the  design 
theory  can  be  applied. 

Many  engineers  first  encounter  a  design  theory  in  Statics.  Therefore,  we  use  a 
problem  from  statics  as  our  example,  from  [40].  The  real  problem  is  to  design  a  roof 
that  will  support  whatever  snow,  wind,  etc.  that  will  stress  it.  The  first  step  in  going 
from  the  “real  problem”  to  the  “given”  for  the  design  problem  is  to  select  some  form 


24 


of  truss.  For  example,  a  Howe  truss  could  be  selected.  This  selection  might  be  based 
on  the  designer’s  recognition  that  it  is  appropriate  for  this  class  of  problem,  but  is 
outside  the  design  theory.  A  second  step  in  going  from  the  “real  problem”  to  the 
“given”  for  the  design  problem  is  to  make  some  assumptions  about  the  loads  that 
will  be  applied  to  the  truss.  These  assumptions  take  the  form  of  a  certain  magnitude 
force  applied  at  certain  points  on  the  truss.  These  assumptions  might  be  based  on 
the  designer’s  knowledge  of  local  weather,  etc.;  but  again,  this  is  outside  the  design 
theory.  We  have  gone  from  the  “real  problem”  to  a  set  of  “givens.”  This  part  of 
the  design  is  not  based  on  any  “design  theory,”  rather  it  depends  upon  the  human 
element  in  design. 

The  “real  solution”  to  this  problem  might  consist  of  a  complete  specification  of 
materials  in  the  truss,  the  size  and  shape  of  the  members  of  the  truss,  how  the 
members  are  joined,  etc.  The  designer  recognizes  that  if  the  forces  in  the  members 
can  be  found,  then  it  would  be  easier  to  complete  the  real  problem.  For  example, 
a  catalog  could  be  used  to  select  truss  members  once  the  maximum  load  on  a  given 
member  was  known.  Therefore,  we  say  the  “find”  is  the  force  in  each  member.  Again, 
going  from  the  “find”  to  the  “real  solution”  will  not  be  aided  by  the  design  theory. 
However,  now  that  we  have  specific  “givens”  and  “finds,”  we  can  apply  the  “design 
theory”  of  Statics  to  connect  these  two.  In  particular,  given  the  loads  on  a  particular 
truss  we  can  solve  for  the  forces  in  each  member  of  the  truss. 

In  summary,  a  design  theory  operates  within  the  simplified  environment  of  specific 
“givens”  and  “finds.”  The  messy  problems  of  determining  the  “givens”  from  the  real 
problem  and  the  real  solution  from  the  “finds”  are  outside  the  theory.  Pattern  Theory 
is  an  attempt  at  a  design  theory  in  this  sense  for  algorithms. 


3.2.2  Definition,  Analysis  and  Specialization 

It  is  important  that  a  design  theory  begin  with  a  well-defined  problem.  Charles 
Kettering  is  reported  to  have  said: 

“A  problem  well  stated  is  a  problem  half  solved.” 

Our  approach  to  stating  the  problem  is  to  first  define  a  very  general  and  abstract 
problem  (Section  3.3).  A  problem  is  well-defined  when  we  can  say  precisely  what  is 
given,  what  is  to  be  found  and  the  criteria  by  which  the  solutions  are  to  be  judged. 
The  problem  will  then  be  analyzed  to  determine  how  it  might  be  partitioned  into 
simpler  problems.  Finally,  we  specialize  to  one  of  the  simpler  problems  (Section  3.4.9). 
We  deliberately  and  explicitly  set  aside  some  aspects  of  the  problem.  There  are  two 
purposes  to  this  approach.  First,  it  allows  us  to  arrive  at  a  well-defined  and  potentially 
solvable  problem.  Second,  it  allows  us  to  understand  how  our  problem  is  a  special 
case  of  more  general  problems. 


25 


3.3  The  General  Problem  of  Computational  Sys¬ 
tem  Design 

Here  we  develop  the  most  general  Pattern  Theory  problem.  This  problem  is  a  central 
part  of  many  disciplines  (c.f.  Chapter  2).  First  we  must  deal  with  several  rather 
general,  almost  philosophical,  issues.  We  will  explain  why  we  are  especially  interested 
in  recognizing  patterns  in  functions,  the  meaning  of  a  representation  of  a  function, 
and  figures-of-merit  for  competing  designs. 

3.3.1  Computation  and  Functions 

We  can  imagine  trying  to  recognize  patterns  in  all  kinds  of  mathematical  objects. 
The  examples  of  Section  2.1  were  typically  sequences.  However,  we  believe  that 
functions  have  a  unique  importance  when  considering  pattern  finding  in  connection 
with  computation. 

First  of  all,  functions  are  a  fundamental  mathematical  concept.  A  function  /  is  a 
set  of  ordered  pairs  from  X  xY  such  that  for  all  (xi,  j/i)  and  (x2,  j/2)  in  />  if  ®i  =  ®2 
then  yi  =  y 2-  This  definition  of  a  function  only  requires  some  set  theory,  order,  and 
logic  as  background. 

The  only  trick  to  being  a  function  is  that  there  be  only  one  output  for  any  given 
input.  For  example,  in  an  Automatic  Target  Recognition  setting  our  assumption  is 
that  there  io  xixactly  one  desired  output  (e.g.  target  type)  for  each  input  (e.g.  an 
image).  This  assumption  does  not  preclude  the  output  from  having  probabilities;  in 
this  case  our  assumption  only  requires  that  there  be  exactly  one  desired  output  prob¬ 
ability  distribution  (e.g.  p{tank)  =  0.1,  p{truck)  =  0.6,  p{tree)  =  0.2,  . . . )  for  each 
input.  Our  assumption  does  preclude  those  cases  where  there  are  a  significant  number 
of  inputs  for  which  multiple  possible  outputs  would  be  acceptable.  For  example,  if 
we  consider  outputs  of  either  0.99  or  1.0  to  be  acceptable  then  our  assumption  does 
not  hold.  While  this  may  seem  to  be  the  more  common  situation,  it  is  possible  to 
define  a  codomain  for  almost  any  real  problem  such  that  the  assumption  does  hold. 
For  example,  we  could  define  “0.99  or  1.0”  as  a  single  output  value. 

Functions  are  also  abstractions  of  most  of  the  traditional  models  of  computation. 
Language  acceptance  is  a  common  model  for  computation  in  the  theory  of  computing. 
Language  acceptance  is  a  special  case  of  a  function;  that  is,  a  language  acceptor  is  a 
function  from  a  set  of  strings  into  the  binary  set  {accept,  reject}.  Problem  solving  is 
a  common  model  of  computation  in  computing  theory  and  some  artificial  intelligence 
contexts.  Problem  solving  is  a  function  from  a  set  of  problem  definitions  into  the  set  of 
possible  solutions.  Decision  making  is  also  a  function  from  the  factors  in  the  decision 
into  the  set  of  possible  decisions.  Functions  are  a  “show  me”  approach  to  modeling 
knowledge.  What  a  computer  (or  person)  knows  is  exactly  the  set  of  questions  that 
it  can  answer.  We  would  say  that  knowledge  is  well  represented  by  a  function  from 
a  set  of  questions  into  a  set  of  answers.  Reference  [49]  discusses  this  relationship 
between  mathematical  functions  and  knowledge  at  length.  Many  models  of  machine 


26 


learning,  e.g.[13,  p.326]  or  [42,  p.6],  can  also  be  interpreted  as  special  cases  of  function 
realization. 

Non-function  computation  problems  exist,  such  as  the  generation  of  one-way  com¬ 
munications  (radio/TV)  or  clocks,  but  virtually  all  conventional  computing  is  well 
modeled  by  functions. 

In  summary,  a  function  is  an  extremely  general  and  well-defined  model  for  com¬ 
putation.  When  we  talk  about  computation  we  are  talking  about  realizing  a  function. 

3.3.2  Representation 

The  notion  of  representation  is  very  important  in  Pattern  Theory  (see  [49,  pp.29-50]). 
The  design  problem  begins  with  some  sort  of  a  representation  of  a  function  and  then 
we  want  to  find  an  efficient  algorithm  that  will  also  be  a  representation  of  that  same 
function.  Therefore,  the  design  problem  is  one  of  translating  representations. 

The  representation  of  a  function  is  meaningful  only  if  there  is  some  agreed  to 
“representation  system.”  The  representation  system  is  kind  of  like  the  syntax  and 
semantics  of  a  language.  It  is  the  background  knowledge  that  one  must  have  to  make 
sense  of  a  representation. 

We  do  not  have  a  formal  definition  of  a  “representation  system.”  Think  of  rep¬ 
resentation  in  the  sense  of  communication.  Whenever  we  represent  a  function,  we 
must  assume  that  the  reader  has  some  knowledge  that  allows  them  to  make  sense 
of  the  representation.  This  “knowledge”  is  what  we  are  trying  to  specify  with  “rep¬ 
resentation  system.”  An  important  and  unsolved  problem  of  Pattern  Theory  (and 
computing  in  general)  is  that  of  dealing  with  this  idea  of  a  representation  system. 
Sections  3.4.3  and  3.4.4  explain  how  we  get  around  this  problem  for  the  PT  1  project. 

The  representation  system  used  for  defining  the  function  to  be  computed  is  called 
the  “input  representation  system.”  Input  comes  from  this  being  the  input  to  the 
design  problem.  PT  1  focused  on  tabular  input  representation  systems.  The  repre¬ 
sentation  system  used  for  the  solution  is  called  the  “output  representation  system.” 
Again,  output  comes  from  this  being  the  output  of  the  design  problem.  PT  1  used 
directed  graphs  with  functions  at  each  node  for  the  output  representation  system. 

In  addition  to  the  concept  of  a  representation  system,  there  are  many  forms  of 
representation  within  each  system.  Several  classes  of  representation  are  identified  as 
examples  of  this  idea. 

First,  there  is  the  simple  table  definition  of  a  function  (e.g.  Table  4.1).  A  table 
seems  to  require  the  minimum  possible  representation  system. 

Secondly,  there  is  the  class  of  algorithmic  representations  of  a  function.  These 
representations  give  an  algorithm  for  computing  f{x)  when  given  x.  The  representa¬ 
tion  f{x)  =  -1-  2®  —  1  is  dgorithmic.  The  representation  system  for  this  example 

must  include  knowledge  of  arithmetic.  A  common  situation  in  fire  control  algorithm 
design  is  to  have  an  algorithmic  definition  of  a  problem  (often  called  a  “truth-model”) 
that  is  too  slow  for  airborne  use.  The  design  problem  is  to  find  a  better  algorithmic 
representation. 


27 


A  third  class  of  representation  might  be  called  the  algorithmic  inverse  class.  Rep¬ 
resentations  from  this  class  provide  algorithms  that,  when  given  y,  produce  x  such 
that  y  =  f{x).  An  example  of  an  algorithmic  inverse  representation  of  /  is  ®  =  y*, 
where  y  =  f{x).  Therefore,  when  given  y  we  can  generate  x  using  the  representa¬ 
tion;  however,  the  representation  does  not  explicitly  tell  us  how  to  generate  y  when 
given  X.  The  “vision”  problem  is  a  more  practical  example  of  an  algorithmic  inverse 
representation.  For  the  vision  problem,  the  function  that  we  want  to  realize  (i.e.  a 
mapping  from  a  two-dimensional  image  into  a  three-dimensional  model  of  a  scene)  is 
easily  represented  in  inverse  form.  That  is,  we  can  use  geometric  projection,  which 
is  algorithmic,  to  determine  what  two-dimensional  image  would  result  from  a  given 
three-dimensional  scene. 

Our  fourth  class  of  representation  is  the  algorithmic  NP  class.  The  “NP”  comes 
from  the  non-deterministic  polynomial  set  of  functions  as  studied  in  time  complexity 
which  have  this  form  of  representation.  For  an  algorithmic  NP  representation,  we 
must  be  given  both  x  and  y  and  then  the  representation  is  an  algorii  m  that  will 
determine  if  y  =  f{x).  An  example  of  this  class  of  representation  is  when  the  function 
has  some  equation  as  its  input  and  solutions  to  the  equation  as  its  outputs.  When 
given  the  equation  and  a  candidate  solution,  it  is  easy  to  tell  if  the  solution  fits. 

A  fifth  class  of  representation  is  called  the  function  predicate  class.  In  this  class, 
the  function  is  represented  by  some  algorithmic  predicate  on  the  whole  function.  An 
example  of  this  class  is  a  differential  equation. 

A  final  class  might  be  any  mix  of  the  above  classes.  For  example,  a  function 
can  be  represented  by  a  differential  equation  (function  predicate  class)  and  boundary 
conditions  (table  class). 

3.3.3  Figures-of-Merit 

There  is  one  other  idea  that  needs  to  be  developed  before  we  can  state  the  general 
problem.  This  idea  has  to  do  with  what  constitutes  a  “good”  design.  A  well  designed 
computational  system  should  have  a  number  of  properties.  We  divide  these  properties 
into  two  general  categories. 

One  category  has  to  do  with  the  accuracy  of  the  computational  system.  That  is, 
how  often  does  it  produce  errors  or  no  output  at  all.  Errors  could  be  defined  as  the 
difference  between  the  desired  function  and  the  function  actually  computed.  There 
are  many  options  for  defining  the  difference  between  functions.  For  example,  if  X 
has  finite  cardinality  and  Y  is  the  set  of  real  numbers  then  the  difference  (d)  between 
functions  f  :  X  Y  and  g  X  Y  might  be  d  =  1  /(®)  “  fif(®)  !•  Por  many 

computational  problems,  we  want  no  errors.  For  other  problems,  avoiding  all  errors 
is  either  simply  not  possible  or  not  worth  the  increased  cost. 

The  second  category  of  properties  concerns  monetary  costs.  There  are  costs  asso¬ 
ciated  with  arriving  at  the  design,  physically  realizing  the  design  and  using  the  design. 
In  arriving  at  the  design  there  are  the  costs  of  gathering  samples  of  a  function  or  of 
performing  experiments  to  narrow  the  possible  set  of  functions.  We  associate  these 
costs  with  the  definition  problem  (see  [50]).  The  cost  of  realizing  a  design  includes  the 


28 


cost  of  purchasing  and  assembling  equipment.  This  is  the  cost  of  concern  in  circuit 
design  (e.g.  [21]).  The  cost  of  using  the  design  is  most  often  thought  of  in  terms  of  the 
run-time  or  memory  use  on  a  sequential  computer  (e.g.  [7],  but  is  also  reflected  in  cir¬ 
cuit  design  as  “depth.”  While  there  is  considerable  latitude  for  trading-off  equipment 
cost  versus  run-time,  we  think  this  trade-off  is  between  different  ways  of  exploiting 
the  singular  pattern-ness  of  a  function  rather  than  between  different  kinds  of  patterns. 
Therefore,  we  want  our  measure  of  pattern-ness  to  be  high  whenever  it  is  possible 
to  realize  a  function  with  low  equipment  cost  and  reasonable  run-time  or  vdth  small 
run-time  and  reasonable  equipment  cost. 

3.3.4  Problem  Statement 

We  now  state  the  general  computation  system  design  problem  of  PT.  The  statement 
of  the  problem  is  not  sufficiently  precise  to  be  useful  in  the  design  theory  sense.  The 
purpose  in  stating  this  general  problem  (a  problem  that  includes  virtually  everything 
anybody  does  with  computational  systems)  is  that  it  will  allow  us  to  show  how  the 
PT  1  problem  (Section  3.4)  is  a  special  case  of  the  general  problem. 

We  state  the  problem  in  terms  of  what  is  given  and  what  is  to  be  found-.  For 
the  general  problem,  we  are  given  an  input  representation  system,  a  set  of  functions 
represented  in  the  input  representation  system,  a  set  of  output  representation  systems 
and  figures-of-merit. 

The  “input  representation  system”  is  the  language  in  which  the  function(s)  to  be 
computed  is  given.  Sometimes,  if  there  is  a  precise  definition  of  the  function,  the 
input  representation  system  might  be  little  more  than  arithmetic.  For  example,  the 
function  might  be  given  as:  “compute  y  when  given  x  where  y  =  3.”  However, 

when  the  function  is  given  in  vague  terms,  the  representation  system  might  include  a 
natural  language  as  well  as  many  value  judgements.  For  example,  a  function  might  be 
given  as:  “compute  y  when  given  x,  where  x  is  time  and  y  is  the  intensity  and  color 
of  the  video  signal  of  a  new  hit  TV  show.”  Although  it  may  always  be  difficult,  and 
sometimes  impossible,  to  specify  the  input  representation  system,  we  think  that  such 
a  characterization  is  a  necessary  step  towards  a  theoretical  engineering  understanding 
of  the  problem. 

In  addition  to  the  input  representation  system,  there  must  be  representations  (ex¬ 
pressed  in  the  input  representation  system)  that  define  the  specific  functions  that 
we  want  to  compute.  In  our  general  statement  of  the  computational  system  design 
problem  we  allow  for  there  to  be  a  set  of  functions  to  be  computed  rather  than  just 
a  single  function.  We  can  imagine  that  when  computing  several  functions,  the  com¬ 
putation  of  one  function  might  be  used  in  computing  a  second  function.  Therefore, 
the  design  problem  is  somewhat  different  when  computing  more  than  one  function. 
It  turns  out  that  it  is  not  as  different  as  we  once  thought  (see  Section  6.2.2). 

The  design  of  a  “generd  purpose”  computer  requires  that  the  most  general  prob¬ 
lem  not  only  allow  for  a  set  of  functions,  but  that  there  be  some  super  set  of  functions 
and  that  we  do  not  know  which  exact  subset  is  to  be  computed.  The  idea  here  is 
that  there  is  some  set  of  functions  that  you  might  potentially  want  to  compute,  but 


29 


you  do  not  know  exactly  which  ones.  Therefore,  the  design  problem  is  to  come  up 
with  the  computer  that  would  do  well  on  average  for  any  subset  of  functions  that 
might  be  specified  later.  This  is  the  problem  faced  by  people  who  design  general 
purpose  computers.  As  with  the  representation  systems,  it  is  not  easy  to  specify  the 
set  of  given  functions  but  this  specification  is  necessary  for  a  theoretical  engineering 
treatment  of  the  problem.  Designing  algorithms  or  electronic  circuits  is  a  special  case 
where  the  functions  to  be  computed  are  known  ahead  of  time. 

The  output  representation  system  is  the  representation  system  that  will  be  used 
to  express  the  design.  For  circuit  design  (including  the  design  of  general  purpose 
computers)  the  output  representation  system  typically  consists  of  some  set  of  circuit 
elements.  In  algorithm  design  the  output  representation  system  might  be  a  partic¬ 
ular  computer  language.  We  said  that  the  “givens”  might  include  a  set  of  output 
representation  systems.  Why  a  set?  In  the  most  general  design  problem  we  include 
the  problem  of  selecting  the  output  representation  system.  By  specifying  an  output 
representation  system,  we  are  limiting  the  scope  of  possible  solutions.  Limiting  the 
scope  of  solutions  is  not  desirable  in  itself  but  is  necessary  for  an  engineering  theory 
of  the  problem.  This  scope  limiting  part  of  the  problem  specification  is  characteristic 
of  other  engineering  design  theories.  For  example,  classical  control  theory  limits  con¬ 
sideration  to  control  laws  based  upon  linear  combinations  of  the  system  states  and 
the  desired  states. 

The  figures-of-merit  reflect  error,  the  cost  of  the  output  representation  systems 
and  the  cost  of  individual  representations.  Differences  between  the  given  function  and 
the  realized  function  is  what  we  are  calling  “error.”  We  sometimes  do  not  insist  that 
the  error  be  zero.  Instead  we  want  it  to  be  close  but  not  to  the  point  of  compromising 
the  other  considerations  (especially  cost).  Therefore,  the  “givens”  must  reflect  our 
relative  tolerance  for  errors  and  cost.  The  cost  of  the  output  representation  system 
is  essentially  the  cost  of  the  computer  hardware.  The  cost  of  the  individual  repre¬ 
sentation  is  sometimes  the  monetary  cost  of  the  hardware  (as  in  circuit  design)  and 
sometimes  the  cost  of  execution  (for  example  the  run-time  of  an  algorithm). 

We  stated  the  problem  in  terms  of  what  is  given  and  what  is  to  be  found.  For  the 
general  problem,  we  are  given  an  input  representation  system,  the  representations  of 
a  set  of  functions,  a  set  of  output  representation  systems  and  figures-of-merit.  The 
problem  then  is  to  find  an  output  representation  system  (from  the  set  given)  and  the 
representations  of  a  subset  of  the  given  functions  such  that  the  figures-of-merit  are 
optimized. 


3.4  The  Pattern  Theory  1  Problem  as  a  Special 
Problem  in  Computational  System  Design 

Recall  that  the  objective  is  to  isolate  that  part  of  traditional  algorithm  design  that 
depends  on  this  special  character  of  patterns  that  we  have  discussed.  This  problem 
will  be  analyzed  and  the  results  extended  back  towards  more  practical  problems. 


30 


There  are  two  basic  mechanisms  for  doing  this  isolation.  First  we  can  partition 
the  general  problem  into  sub-problems,  allowing  us  to  set  aside  some  very  difficult 
practical  problems  that  are  not  directly  involved  in  the  pattern  issues.  The  retained 
portion  of  the  partition  will  have  the  pattern  issues  more  accessible.  The  second 
mechanism  is  to  simplify  the  general  problem,  always  retaining  a  non-trivial  pattern 
finding  problem. 

3.4.1  Special  Rather  Than  General  Purpose  Computers 

The  Pattern  Theory  (PT)  1  problem  is  concerned  only  with  realizing  a  set  of  func¬ 
tions  that  are  known  ahead  of  time.  As  discussed  above,  when  designing  general 
purpose  computers,  we  do  not  know  what  exact  functions  we  will  eventually  be  com¬ 
puting.  The  design  of  “special  purpose”  computers  includes  circuit  design  as  well  as 
specific  uses  of  general  purpose  computers.  Therefore,  our  sense  of  “special  purpose” 
computer  design  includes  algorithm  design, 

3.4.2  Single  Function  Realization 

For  the  general  problem  we  allowed  for  there  to  be  several  functions  to  be  realized. 
The  PT  1  is  concerned  with  realizing  only  a  single  function.  Realizing  a  single  function 
seemed  to  be  a  simpler  problem  that  still  requires  pattern  finding  in  a  non-trivial 
sense.  It  turns  out  that  it  is  not  possible  to  get  completely  away  from  realizing 
multiple  functions  because  when  you  decompose  a  single  function  you  are  creating 
multiple  “sub-functions”  that  must  be  realized.  The  Lupanov  representation  (see  (54, 
pp.  116- 118])  re-uses  computations  of  sub-functions  in  realizing  individual  functions. 
However,  for  the  relatively  small  number  of  variables  considered  in  the  PT  1  study, 
this  re-use  technique  is  not  effective.  Therefore,  it  is  meaningful  to  consider  single 
function  realization  as  a  further  specialization  to  the  general  computing  problem. 

3.4.3  Input  Representation  System 

For  the  general  computing  problem,  we  did  not  specify  a  particular  input  represen¬ 
tation  system.  In  fact,  we  did  not  even  give  a  formal  definition  of  a  representation 
system.  For  the  PT  1  problem  we  chose  to  limit  consideration  to  input  functions 
represented  as  tables.  The  representation  system  for  tables  is  trivial;  that  is,  the 
knowledge  required  to  use  a  table  is  trivial.  This  specialization  allows  PT  1  to  avoid 
having  to  deal  with  the  messy  problem  of  input  representation  systems.  However,  the 
pattern  finding  problem  when  given  a  function  in  the  form  of  a  table  is  not  trivial. 
In  fact,  this  is  kind  of  a  worst  case  for  pattern  finding.  When  a  function  is  specified 
in  some  non-trivial  representation  system  there  may  be  some  clues  as  to  the  patterns 
in  the  function.  However,  when  the  function  is  given  as  a  table,  there  are  no  clues. 
As  desired,  this  specialization  results  in  a  cleaner  theoretical  problem  while  retaining 
the  essential  pattern  finding  problem. 


31 


3.4.4  Output  Representation  System 

The  output  representation  system  for  PT  1  will  be  kept  at  a  fairly  abstract  level. 
Therefore,  selecting  an  output  representation  system  does  not  require  that  we  select 
a  specific  programming  language  or  a  specific  set  of  circuit  elements.  The  output 
representation  system  for  PT  1  is  a  directed  graph  with  a  function  associated  with 
each  node.  The  details  are  defined  in  Chapter  4.  On  the  surface  it  may  appear  that 
PT  1  specializes  to  combinational  machines  with  a  loss  of  applicability  to  sequential 
machines  and,  in  fact,  we  do  represent  our  decompositions  combinationally.  However, 
because  of  the  connection  between  time  (sequential)  complexity  and  size  (combina¬ 
tional)  complexity,  the  patterns  found  in  the  decomposition  process  are  not  essentially 
different  than  sequential  patterns  (see  Section  4.4).  Therefore,  although  PT  1  does 
specialize  to  combinational  output  representation  systems,  it  does  so  without  loss  of 
generality. 


3.4.5  Functions 

PT  1  makes  a  number  of  specializations  to  the  kind  of  functions  considered.  First 
of  all,  PT  1  is  only  concerned  with  finite  functions.  Although  it  has  been  useful 
to  use  infinite  functions  (to  the  exclusion  of  finite  functions,  i.e.  all  finite  functions 
are  computable  and  have  complexity  0(1))  in  most  traditional  computing  theory 
paradigms,  there  is  no  greater  generality  in  infinite  functions.  It  is  simply  a  matter  of 
convenience.  We  feel  that  any  real  problem  can  be  modeled  finitely,  whether  or  not 
the  solution  is  eventually  implemented  in  an  analog  or  discrete  system.  Therefore, 
although  unusual,  our  specialization  to  finite  functions  is  without  loss  of  generality. 

We  are  especially  interested  in  mappings  on  domains  whose  elements  have  parts. 
That  is,  the  inputs  are  made  up  of  multiple  parts.  There  are  two  common  models 
for  these  multiple  part  inputs,  the  string  and  the  vector.  Vectors  are  thought  of  as 
elements  of  a  product  of  sets  (as  in  x  X2  x  X3  x  •  •  •  x  Xn).  All  the  vectors  in 
a  set  typically  have  the  same  length,  that  is  dimension,  not  metric  length.  Infinite¬ 
dimensional  vectors  are  commonly  used  in  Real  Analysis.  Strings  are  thought  of 
as  a  sequence  of  drawings  from  a  single  set.  Strings  are  typically  not  all  the  same 
length.  Strings  may  also  have  .infinite  length.  Either  is  sufficiently  general  to  model 
the  other.  All  strings  of  length  n  or  less  from  an  alphabet  S  can  be  modeled  as 
vector  elements  of  (S  U  {6/onfc})”,  where  vectors  with  a  blank  left  of  a  non-blank 
are  not  included.  Similarly,  vectors  from  JCi  x  ^"2  x  ^3  x  •  •  •  x  Xn  can  be  modeled 
as  those  strings  of  length  n  from  the  alphabet  S  =  U"_i  with  the  component 
from  Xj,  Vectors  are  especially  common  models  in  Electrical  Engineering  applica¬ 
tions  such  as  Circuit  Design,  Estimation  Theory,  Control  Theory,  and  Digital  Signal 
Processing.  String  based  models  are  especially  common  in  Computer  Science,  e.g. 
compilation  problems,  and  are  used  in  Computability  Theory.  Vectors  have  slightly 
more  transparent  combinatorics  (see  Appendix  A).  There  are  approximately  twice 
as  many  strings  as  vectors  for  a  given  maximum  length.  While  this  difference  might 
be  important  in  some  particular  instance,  the  trends  are  the  same  for  functions  on  a 


32 


set  of  vectors  as  for  fun,ctions  on  a  set  of  strings.  The  ideas  in  this  report  could  be 
developed  exclusively  in  terms  of  strings  or-  vectors;  without  any  significant  difference 
in  the  fundamental;  results.  We  use  vectors  in  the  quantitative  discussions  because  of 
their  simplified  combinatorics;  however,  we  also  use  the  string  nomenclature  in  order 
to  highlight  relationships;  to  the  traditional  theory  of  computing. 

In  addition  to  limiting  consideration  to  finite  functions,  PT  1  limits  consideration 
’  to  binary  functions.  Binary  functions  are  functions  of  the  form  /  :  {0, 1}"  — >  {0, 1}. 

Functions  on  any  other  finite  domain  can  be  inodeled  as  binary  functions.  Also,  func- 
.  tions  with  other  codomains  can  be  modeled  with  multiple  binary  functions.  Therefore, 

there  is  little  loss  of  generality  in  limiting  consideration  to  binary  functioiis. 

3.4.6  Definition  versus  Realization 

In  reference  [51]  we  made  a  distinction  between  the  problem  of  choosing  what  function 
to  compute  (the  definition  problem)  and  the  problem  of  figuring  out  how  to  compute 
the  chosen  function  (the,  realization  problem).  Even  in  our  general  statement  of  the 
computing  problem  we  had  already  limited  consideration  to  the  realization  prob¬ 
lem.  The  Pattern  Theory  idea  is  to  associate  patterns  and  simplicity.  Patterns  are 
those  functions  with  economic  realizations;  these  may  be  called  “realization  patterns.” 
There  may  also  be,  in  this  pragmatic  sense,  “definition  patterns.”  That  is,  a  function 
has  a  definition  pattern  if  it  is  easy  to  define,  for  example,  amenable  to  interpolation. 
It  is  not  clear  whether  or  not  these  two  concepts  of  pattern  can  be  unified;  but  for 
PT  1,  pattern  is  addressed  only  in  the  realization  pattern  sense. 

3.4.7  Figure-of-Merit  for  the  PT  1  Problem 

The  figure-of-merits  for  the  general  problem  included  a  consideration  for  error  in  the 
realized  function.  This  factor  is  important  in  real  computational  system  design  (see 
[61]),  but  perhaps  not  as  important  as  we  once  thought  (see  Section  6.6).  PT  1  limits 
consideration  to  exact  realizations  (as  in  [48,  p.33]).  This  specialization  is  made  a^ 
the  expense  of  generality  to  allow  us  to  focus  on  the  pattern  issues.  Although  finding 
the  pattern  in  a  function  that  allows  its  exact  computation  is  a  special  case  of  finding 
the  patterns  that  allow  functions  to  be  approximately  computed,  it  is  by  np  means 
trivial.  We  think  the  essential  chf^racter  of  the  pattern  finding  problem  is  preserved  in 
♦  the  PT  1  problem  and  made  more  readily  accessible  by  setting  aside  the  error  issues. 

Section  6.6  further  discusses  the  cost-error  trade-offs. 

The  general  problem  also  included  several  measures  of  cost  (hardware  costs,  run¬ 
time,  etc.).  A  central  thesis  of  the  Pattern  Theory  paradigm  is  that  all  these  costs 
are  well  represented  (with  respect  to  our  pattern  finding  problem)  in  one  abstract 
measure.  Decomposed  Function  Cardinality.  Chapter  4  explains  and  supports  this 
thesis.  Therefore,  the  PT  1  problem  specializes  to  this  single  measure  with  very  little 
loss  of  generality. 


33 


3.4.8  Kinds  of  Patterns 


We  discussed  previously  that  functions  have  a  property  called  “realization  pattern- 
ness.”  That  is,  some  functions  have  simple  realizations  while  others  do  not.  Of  course, 
whether  or  not  a  function  has  a  simple  realization  is  relative  to  the  representation 
system.  Reference  [48]  develops  the  idea  that  there  are  two  fundamental  mechanisms 
that  allow  simple  realizations.  That  is,  there  are  two  kinds  of  realization  patterns. 
One  kind  is  based  on  the  relationship  between  a  function  and  physical  processes  (call 
this  kind  “physical  patterns”).  The  second  kind  is  based  solely  on  the  decomposability 
of  the  function  (call  this  kind  “decomposition  patterns”).  PT  1  is  concerned  only  with 
this  second  kind  of  realization  patterns. 

There  is  some  definite  loss  of  generality  here.  Some  functions  are  naturally  realized 
by  some  devices  and  exploiting  this  is  essential  in  some  real  problems,  for  example 
the  computation  performed  by  an  optical  lens.  However,  applications  of  general 
purpose  computers  seem  to  rely  on  decomposition  patterns.  Functions  with  physical 
realizations  may  also  have  decomposition  pattern-ness  (e.g.  addition  can  be  realized 
physically  by  many  means  and  addition  also  has  high  decomposition  pattern-ness). 
There  seems  to  be  a  problem  here  for  the  physicists  to  worry  about.  Why  does  the 
natural  world  have  such  a  high  degree  of  decomposition  pattern-ness? 

3.4.9  Problem  Statement 

Recall  that  the  objective  is  to  isolate  that  part  of  traditional  algorithm  design  that 
depends  on  this  special  character  of  patterns  that  we  have  discussed. 

Through  the  specializations  discussed  above,  we  arrive  at  the  PT  1  problem. 
That  is,  given  a  single  finite  binary  function  completely  defined  as  a  table,  find  an 
exact  combinational  realization  that  minimizes  the  Decomposed  Function  Cardinality 
measure. 

We  have  gone  from  a  very  general,  but  somewhat  vague  statement  of  computa¬ 
tional  system  design  to  a  less  general  but  definite  statement  that  retains  the  essential 
pattern  finding  problem. 


3.5  Summary 

This  chapter  explained  the  need  for  an  algorithm  design  theory  and  defined  the  Pat¬ 
tern  Theory  paradigm  as  a  potential  approach. 

The  need  for  an  algorithm  design  theory  was  shown  by  starting  with  the  obvious 
need  for  offensive  avionics,  showing  how  important  computing  power  is  in  offensive 
avionics,  showing  how  important  algorithm  design  is  in  computing  power  and  then 
finally  showing  how  important  an  engineering  theory  is  to  design.  Although  this  de¬ 
velopment  could  have  been  based  on  many  different  kinds  of  problems,  the  importance 
of  an  algorithm  design  theory  to  offensive  avionics  is  sufficient  to  justify  our  research. 

The  Pattern  Theory  paradigm  was  defined  by  first  discussing  the  form  of  a  problem 
definition  for  an  engineering  theory.  We  then  stated  the  most  general  problem  that 


34 


might  be  characterized  as  “computational  system  design.”  Finally,  we.  made  explicit 
specializations  to  this;  general  problem  to  arrive  at  the  FT  1  problem,  i.e.  the  problem 
of  designing  exact  combinational  resdizations.  of  binary  functions,  given  as  a. table,  such 
that  the  Decomposed  Function  Cardinality  (DFC)  is  minimized.  The  FT  1  problem 
is  much  simpler  than  the  general  problem;  yet  it  retains  the,  essential  pattern,  finding 
problem  that  is  our  focus. 


35 


36 


Chapter  4 


Decomposed  Function  Cardinality 
as  a  Measure  of  Pattern-ness 

4.1  Introduction 

The  central  thesis  of  Pattern  Theory  is  that  function  decomposition  is  a  way  to  get  at 
the  essential  idea  of  computational  complexity.  The  connection  between  decomposi* 
tion  and  computation  has  come  up  before.  The  “divide  and  conquer”  principle  (e.g.  [7, 
p.3])  is  essentially  a  suggestion  that  decomposition  is  a  good  idea  for  algorithm  de¬ 
sign.  The  “chunking”  model  of  learning  (see  [43])  is  a  form  of  function  decomposition. 
The  Abductive  Reasoning  paradigm  represents  functions  by  compositions  of  simpler 
functions.  Function  decompositioh  is  also  a  generalization  of  representations  that 
use  arithmetic  or  logical  operators.  When  we  were  first  exposed  to  pattern  recogni¬ 
tion  and  machine  learning,  we  were  impressed  with  the  prominent  role  of  arithmetic 
operators,  e.g.  “. . .  the  adaptive  linear  combiner,  the  critical  component  of  virtually 
all  practical  adaptive  systems.”^  We  asked  ourselves,  “What  makes  arithmetic  so 
special?”  Why  should  it  have  any  special  powers?  We  now  believe  the  answer  lies 
in  the  fact  that  arithmetic  operators  are  members  of  a  common  class  of  decomposi¬ 
tions.  However,  there  are  other  classes  and  that  is  why  it  is  important  to  understand 
decomposition  as  the  underlying  principle  with  linear  combiners  as  a  special  case. 

We  propose  Decomposed  Function  Cardinality  {DEC)  as  a  quantification  of  a 
function’s  pattern-ness.  Section  4.2  contains  a  formal  definition  of  DEG.  Informally, 
we  base  our  measure  on  the  cardinality  of  a  function.  A  function  is  a  set  of  ordered 
pairs  and,  as  with  any  set,  a  function  has  some  cardinality.  That  is,  for  finite  functions, 
a  function  has  some  number  of  elements.  Function  h  of  Table  4.1  has  cardinality  8 
while  functions  /  and  g  of  Table  4.2  have  cardinality  16. 

Now  we  need  to  distinguish  between  functions  of  the  same  cardinality  that  have 
different  pattern-ness.  First  recognize  that  some  functions  can  be  represented  as 
a  composition  of  smaller  functions.  For  example,  /  in  Table  4.2  can  be  written 

‘From  the  UCLA  Adaptive  Neural  Network  and  Adaptive  Filters  Course  announcement,  Bernard 
Widrow  and  Mark  A.  Gluck  Instructors,  1991. 


37 


Table  4.1:  Function  Cardinality  of  /i  is  8 


®l 

®2 

®3 

®4 

0 

0 

0 

0 

1 

0 

0 

0 

0 

1 

0 

0 

0 

0 

1 

0 

0 

1 

0 

0 

1 

1 

0 

0 

0 

1 

0 

0 

1 

1 

0 

1 

0 

1 

0 

1 

0 

1 

1 

0 

0 

0 

0 

1 

1 

1 

0 

0 

1 

0 

0 

0 

1 

1 

1 

0 

0 

1 

0 

0 

1 

0 

1 

0 

0 

0 

1 

0 

1 

1 

0 

0 

1 

1 

0 

0 

0 

1 

1 

1 

0 

1 

0 

0 

1 

1 

1 

0 

1 

1 

1 

1 

1 

1 

0 

0 

38 


X 


¥ 


Figure  4.1;  /  as  a  Composition  of  Smaller  Functions 


22 

F(zi,Z2) 

<^(21,22) 

V’(2i,Z2) 

0 

0 

1 

0 

0 

0 

1 

0 

1 

0 

1 

0 

0 

1 

0 

1 

1 

0 

0 

1 

Table  4.3:  Functions  that  Compose  f 


f{x] ,  X2,  ®3,  x^)  =  F(0(V'(xi,  X2),  X3),  X4).  This  representation  of  /  can  be  diagrammed 
as  in  Figure  4.1  with  F,<f>  and  V*  defined  in  Table  4.3.  Notice  that  the  cardinality  of 
F,  <j)  and  V'  is  4  each.  The  sum  of  their  cardinalities  is  12.  Therefore,  /,  a  function 
of  cardinality  16  can  be  represented  by  a  composition  of  functions  whose  combined 
cardinality  is  only  12.  We  say  that  /  has  a  Decomposed  Function  Cardinality  of  12. 
Most  functions,  such  as  g  in  Table  4.2,  cannot  be  composed  from  smaller  functions 
in  this  way.  Some  functions  have  more  than  one  decomposition;  a  decomposition  is  a 
representation  of  a  function  as  a  composition  of  smaller  functions.  When  a  function 
has  more  than  one  decomposition,  DFC  is  defined  to  be  the  minimum  combined 
component  cardinality  of  all  the  decompositions  of  that  function.  The  familiar  de¬ 
composition  of  addition  would  look  like  Figure  4.2  with  components  as  illustrated  in 
Tables  4.1  and  4.1.  The  cardinality  of  oj  and  c\  is  4  each,  the  cardinality  of  02,  03, 
C2  and  C3  is  8  each;  therefore,  the  DFC  of  adding  two  numbers  of  three  bits  each  is 
4-1-4-1-8-1-8-1-8-1-8  =  40. 

The  psdindrome  acceptor  on  six  variables  is  a  function  with  cardinality  64  (see  Ta¬ 
ble  4.1).  Figure  4.3  is  a  decomposition  of  a  palindrome  acceptor  with  Tables  4.1  and 


39 


Figure  4.2:  Decomposition  of  Addition 


®1®2®3 

®4®5®6 

yi 

1/2 

ya 

y4 

000 

000 

0 

0 

0 

0 

000 

.  001 

• 

0 

0 

0 

1 

010 

• 

no 

1 

0 

0 

0 

111 

no 

1 

1 

0 

1 

111 

111 

1 

1 

1 

0 

Table  4.4:  Addition  on  Six  Variables  (Four  Output  Functions) 


®3®6 

ai 

Cl 

00 

0 

0 

01 

1 

0 

10 

1 

0 

11 

0 

1 

Table  4.5:  Addition  Components  ai  and  C\  (XOR  and  AND) 


40 


CiX2»S 

0,2 

C2 

000 

0 

0 

001 

1 

0 

010 

1 

0 

on 

0 

1 

0 

0 

1 

0 

101 

0 

1 

110 

0 

1 

111 

1 

Table  4.6:  Addition  Components  02  and  C2 


xl 

x6 

X2 

x5 

x3 

X4 


yi 


Figure  4.3:  Decomposition  of  a  Palindrome  Acceptor 

4.1  showing  typical  components.  The  DFC  of  the  palindrome  acceptor  on  six  vari¬ 
ables  is  4  -b  4  -b  4  +  8  =  20. 

Figure  4.4  is  a  decomposition  of  the  prime  number  acceptor  on  9  variables  with 
Table  4.1  showing  typical  components.  The  DFC  of  this  particular  decomposition 
for  the  prime  number  acceptor  for  inputs  between  0  and  511  is  344. 

Figure  4.5  is  a  decomposition  of  the  function  which  declares  whether  a  pixel  should 
be  black  or  white  given  the  coordinates  of  that  pixel  for  the  16  X  16  pixel  image  in 
Figure  4.6.  Tables  4.1  through  4.1  define  the  components  of  this  decomposition.  The 
DFC  of  the  “R”  function  is  36.  Note  that  variables  a!.j  and  Xg  are  not  needed. 

All  these  functions  have  DFCh  less  than  the  cardinality  of  the  undecomposed 
function.  We  would  consider  each  of  these  functions  to  be  patterned.  The  pedindrome 
acceptor  ([/]  =  2®  =  64  versus  DFC  =  20)  and  the  “R”  function  ([/]  =  2®  =  256 


41 


Table  4.7:  Palindrome  Acceptor  on  Six  Variables 


Xi®6 

ai 

00 

1 

01 

0 

10 

0 

11 

1 

Table  4.8:  Palindrome  Acceptor  Component  Oj  (NOT  XOR) 


01^203 

6 

000 

0 

001 

0 

010 

0 

on 

0 

100 

0 

101 

0 

no 

0 

111 

1 

Table  4.9:  Palindrome  Acceptor  Component  b  (AND) 


42 


:9 


■ 


Table  4.11:  Letter  R  Components  c  and  d 


43 


Figure  4.5:  Decomposition  of  an  Image 


Figure  4.6:  An  Image  of  “R” 


44 


Table  4,12:  Letter  R  Component  a 


ad 

b 

00 

1 

01 

0 

10 

0 

11 

0 

Table  4.13:  Letter  R  Component  b 


45 


versus  DFC  =  36)  are  very  patterned,  while  the  prime  number  acceptor  ([/]  =  2^  = 
512  versus  DFC  =  344)  is  only  slightly  patterned. 

The  components  of  a  decomposition  can  be  any  kind  of  function,  not  just  the  usual 
logical  functions  (AND,  OR,  XOR,  etc.).  Note  also  that  some  of  these  decompositions 
are  very  familiar  (e.g.  addition  or  the  palindrome  acceptor)  while  others  were  not 
previously  known  (e.g.  the  primality  test  or  the  “R”). 

We  believe  that  DFC  is  a  very  robust  measure  of  pattern-ness.  As  in  the  previous 
examples,  where  completely  different  kinds  of  patterns  are  involved,  a  small  DFC 
(compared  to  the  cardinality  of  the  undecomposed  function)  is  truly  indicative  of 
pattern-ness. 

The  development  of  DFC  began  at  the  Air  Force  Institute  of  Technology  (AFIT) 
where  Matthew  Kabrisky  stressed  the  fundamental  importance  of  features  in  recog¬ 
nizer  design.  DFC  resulted  from  an  attempt  to  generalize  the  geometric  pattern 
recognition  concept  of  “features.”  We  realized  that  features  are  just  intermediate 
stages  in  the  data  flow  that  promote  a  computationally  efficient  realization  of  a  func¬ 
tion.  Function  decomposition  also  produces  these  intermediate  stages  of  data  for 
exactly  the  same  purpose.  DFC  allows  for  an  explanation  of  pattern-ness  that  covers 
the  common  pattern  recognition  paradigms,  i.e.  geometric,  syntactic  and  artificial 
intelligence  approaches  [4°.].  It  seems  that  the  one  common  factor  in  all  these  ap¬ 
proaches  was  this  decomposition  idea.  The  idea  gained  further  credence  when  we 
realized  that  the  “Divide  and  Conquer”  approach  to  algorithm  design  is  essentially  a 
function  decomposition  approach.  We  were  eventually  directed^  to  some  theoretical 
developments  of  function  decomposition  in  the  Switching  Theory  literature.  Here 
again  is  this  idea  in  yet  another  context. 

In  addition  to  these  informal  arguments  for  the  robustness  of  DFC  as  a  measure 
of  pattern-ness  there  are  several  more  objective  supporting  arguments.  Chapter  6 
empirically  relates  DFC  to  factors  in  data  compression  and  to  complexity  as  rated 
by  people.  The  remainder  of  this  chapter  will  report  on  the  relationship  between 
DFC  and  three  of  the  most  common  measures  of  computational  complexity.  First 
we  consider  program  length.  By  treating  DFC  as  a  component  of  program  length  we 
can  apply  some  useful  results  from  Information  Theory  (see  [12,  23]).  We  then  relate 
DFC  to  time  complexity.  Under  reasonable  assumptions  one  can  prove  that  a  small 
time  complexity  implies  a  small  DFC  (see  [54]).  Finally,  we  point  out  that  DFC  is  a 
special  case  of  circuit  size  complexity.  By  connecting  DFC  to  these  traditional  mea¬ 
sures  (i.e.  program  length,  time  complexity  and  circuit  size  complexity)  we  support 
the  contention  that  DFC  is  a  reflection  of  a  very  general  property,  as  we  would  hope 
would  be  the  case  for  a  measure  of  pattern-ness. 

'•^By  Frank  Brown  of  AFIT. 


46 


4.2  Decomposed  Function  Cardinality 

We  need  a  more  formal  notation  for  the  representation  of  a  function’s  decomposition 
to  define  Decomposed  Function  Cardinality  (DFC).  For  /  :  x  X2  x  •  •  •  x  — >  F,  a 
representation  r/  of  /  is  a  finite,  acyclic,  directed  graph  G  and  a  set  of  finite  functions 
P.  That  is,  r/  =  {G,P).  G  and  P  are  defined  below. 

It  will  simplify  the  notation  if  we  rename  some  sets.  We  will  use  U{  to  represent 
input  intermediate  (Zk)  ,  and  output  (F)  sets.  In  particular,  let 


Ui=l 


Xi 

Zi-n 

Y 


i  =  1,  2,  . . . ,  n 
i  =  n+1,  n+2, . . . ,  n+[P]-l 
i  =  n+IP). 


In  this  notation, 

/  :  f/i  X  C^2  X  •••  X  C;„  Un+[P], 

and  Un+i)Un+2i  •  •  •  j  f^n+(P)-i  are  the  intermediate  (“feature”)  sets  that  are  created  as 
a  result  of  the  decomposition. 

The  graph  ((?)  consists  of  a  set  of  vertices  (F)  and  a  set  of  arcs  (A).  That  is, 
G  =  (V,A).  There  is  a  vertex  in  the  graph  for  each  variable  in  the  representation, 
i.e.  V  =  {tii,U2, . . . ,tin+[P]}<  The  lower  case  u,-  is  a  variable  for  the  set  C/,-.  A  is  a 
subset  of  F  X  F,  called  a  set  of  arcs.  Indegree  of  u;  is  0  for  i  =  1,2, ...  ,n;  outdegree 
of  u„+(p]  is  0;  indegree  of  Un+[p]  is  1;  indegree  and  outdegree  of  Ui  are  not  0  for 
t  =  Tl-j-l,tl-l-2,  ...,Tl  +  [P]  —  1. 

P  ~  {pijP2>  •  •  mP(Pi}  js  ^  set  of  nonempty  and  nonconstant  functions  of  the  fol¬ 
lowing  form: 

Pj  :  n  i  =n-f  l,n-f  2,...,n-b  [F], 

where  Ij  is  a  set  of  the  indices  of  the  input  variables  for  pj.  That  is,  with  I  the  set  of 
positive  integers, 

/j  =  {i  6  /  i  {ui,Uj)  €  A}. 

A  and  F  are  such  that  (u,-,Uj)  €  A  if  and  only  if  there  exists  &  pj  Q  P  which  has 
i  €  Ij-  Let  nj  be  the  number  of  variables  input  to  pj.  That  is,  nj  =  [7^]. 

In  summary, 

rf  =  {G,P\ 

(?  =  (F,A), 

F  =  {til,U2,...,U„+[p)}, 

A  C  F  X  F, 

P  =  {Pl»P2,--.,P(P)}, 

and 

Pj  :  n  Uj- 
'€0 


47 


DFC  (c.f.  [48,  pp.37-48]'’)  is  used  with  two  meanings.  First,  it  is  used  to  denote 
the  DFC  of  a  particular  representation  of  function, 

IP] 

DFC{ri)  =  -£(Jl[Ui]). 

i=l  i€/j 

When  Uj  =  {0,1}  for  all  j,  this  becomes: 

[/’) 

jDFC'(r/)  =  X;2"^ 

i=i 

We  are  occasionally  interested  in  using  the  actual  cardinality  [pj]  of  the  component 
functions  rather  than  2"^'.  We  call  this  Decomposed  Partial  Function  Cardinality 
(DPFC), 

[P] 

DPFC(r,)  = 

i=l 

When  all  the  component  functions  of  a  representation  are  total  the  two  measures 
DFC  and  DPFC  are  the  same. 

When  we  talk  about  the  DFC  of  a  function,  we  mean  the  DFC  of  that  function’s 
optimal  representation.  If  %  is  the  set  of  all  possible  representations  then 

DFC{f)  =  minimum  over  r/  €  7^  of  {DFC{rf)). 

The  DPFC  of  a  function  is  similarly  defined.  We  also  use  simply  DFC  (or  DPFC) 
when  the  particular  function  or  representation  is  made  clear  by  the  context. 


4.3  Decompositions  Encoded  as  Programs 

4.3.1  Introduction 

We  want  to  relate  DFC  to  program  length  as  developed  in  Appendix  A.  In  partic¬ 
ular,  we  want  to  formalize  how  representations  of  decompositions  are  special  cases 
of  programs.  The  cost  function  on  representations  that  then  corresponds  to  pro¬ 
gram  length  is  of  special  interest  since  we  know  a  number  of  properties  of  program 
length.  We  must  also  address  the  concern  that  a  decomposition  of  a  function  consists 
of  both  component  functions  and  their  interconnections  while  DFC  only  measures 
the  complexity  of  the  component  functions.  After  all,  it  could  be  the  case  that  the 
complexity  of  the  interconnections  is  independently  important  in  the  true  measure  of 
pattern-ness. 

We  will  be  concerned  with  a  binary  function  /:{0, 1}"  — >  {0, 1}  and  representations 
of  such  functions  denoted  rj.  We  will  also  be  interested  in  the  optimal  representation 

“In  the  notation  of  [48],  DFC  is  exactly  fc{Ff-opi)  when  to(r)  =  1  for  all  r  in  R'  and  R'  is  the 
set  of  all  Boolean  functions  of  the  form  /  :  {0, 1}"  — ♦  {0, 1}"*,  m  and  n  positive  integers. 


48 


of  a  given  function  /  which  will  be  denoted  Rj.  Let  A*  denote  the  set  of  all  finite 
strings  from  alphabet  A.  Define  an  encoding  procedure  e  ;  72.  {0, 1}*,  where  72  is 

the  set  of  all  possible  representations.  We  prove  that  e(r/)  is  a  program  in  the  formal 
sense  of  Appendix  A.  We  define  a  cost  function  on  encodings  as  simply  their  string 
length,  L{e{rf)).  L  is  like  a  combinational  form  of  Kolmogorov  complexity  and  has 
many  of  its  properties  (see  [33]  and  [54,  Section  5.7]).  We  use  the  relationships  of 
Appendix  A  to  prove  a  number  of  properties  of  e.  In  particular  we  prove  that  for  all 
/ 

2(n  +  l)<L(e(il;))<2"  +  l, 
there  exists  a  function  /  such  that 

L(e{R,))  =  2"  +  1, 


and  that 

2"  <  average  (L(e(J2/)))  <  2"  +  1. 

With  respect  to  the  concern  that  DFC  does  not  reflect  the  interconnection  complexity, 
we  prove  that  if  DFC  is  small  then  L{e{Rj))  (which  does  reflect  interconnection 
complexity)  is  als.o  small. 

4.3.2  Encoding  Procedure 

We  now  define  a  method  for  encoding  decompositions  into  binary  strings.  The  en¬ 
coding  procedure  produces  a  binary  string  e(r/)  which  is  an  identifier  of  a  unique 
function  of  the  form  /  :  {0, 1}”  {0, 1}.  The  procedure  for  generating  encodings  is 

defined  on  72',  a  subset  of  72.  In  particular,  using  the  notation  of  Section  4.2,  72'  is 
the  subset  of  72  such  that: 

1.  [P]  <  2", 

2.  [pi]  <  2"  for  i=  1,2,.. .,[P], 

3.  [A]  <  2". 

We  will  prove  that  although  72'  is  a  proper  subset  of  72,  72'  includes  all  optimal 
representations.  We  use  the  notation  [a]  to  specify  the  smallest  integer  that  is  greater 
than  or  equal  to  the  real  number  a.  Similarly,  [oj  is  the  largest  integer  less  than  or 
equal  to  a.  There  is  one  other  relationship  which  is  a  direct  consequence  of  1  above. 

Theorem  4.1  For  any  rj  in  TV  an  arc  in  A  can  be  specified  with  n'^  bits  for  n  >  4. 

Proof: 

By  definition  of  V,  [F]  <  (n+[P]).  Thus,  a  v,-  in  V  can  be  specified  with  [log(n-f  [P])] 
bits  (log  to  the  base  2).  By  constraint  1  on  72',  [log(7i  +  [P])]  <  [log(n  -f  2")]  < 
[nlog(n)].  Finally,  since  an  arc  can  be  defined  by  its  head  and  tail  vertices,  an  arc 
can  be  specified  with  2[nlog(n)]  bits,  which  is  less  than  or  equal  to  for  n  >  4. 

□ 


49 


The  objective  of  the  following  encoding  scheme  is  to  encode  any  reasonable  r/ 
into  as  short  a  binary  string  as  possible  while  keeping  a  manageable  expression  for 
i(e(r/)).  We  assume  that  n  is  known.  If  n  is  not  known,  a  unary  representation  of  n 
using  n  +  1  bits  could  be  added  to  the  front  of  the  encoding.  The  encoding-  assumes 
that  all  functions  involved  are  total.  If  a  partial  function  is  involved,  it  can  be  made 
total  by  arbitrarily  assigning  ail  Don’t-Cares  to  be  0.  The  encoding  procedure  is  as 
follows; 

1.  The  first  bit  of  e(ry))  indicates  whether  or  not  the  function  is  decomposed. 
When  this  bit  is  0,  the  function  is  not  decomposed  and  the  rest  of  the  program 
lists,  in  the  order  of  the  domain  of  /,  all  the  images  of  /.  When  this  bit  is  1 
the  rest  of  the  encoding  is  as  follows. 

2.  The  next  n  bits  specify  [Pj,  which  is  possible  by  the  first  constraint  on  TV. 

3.  If  [P]  is  zero  then  /  is  either  a  constant  or  a  projection  of  one  of  its  variables. 
When  [P]  is  zero  the  next  n  -f  1  bits  of  the  encoding  indicate  which  constant 
or  projection  function.  If  /  is  a  projection  of  the  i'*  variable  then  the  of 
the  first  n  bits  is  one  and  the  others  are  zero.  If  /  is  a  constant,  which  will 
be  indicated  by  the  first  n  bits  being  all  zero,  then  the  n  -f  bit  indicates 
which  constant  function  is  /.  The  total  encoding  of  any  constant  or  projection 
function  therefore  requires  2(n  -f  1)  bits.  If  /  is  not  a  constant  or  projection 
function  then  [P]  is  not  zero  and  the  encoding  proceeds  as  follows. 

4.  The  encoding  repeats  the  following  for  i  =  1 . . .  [F] 

a.  The  first  n  bits  specify  [p;],  which  is  possible  by  the  second  constraint  on  TV, 

b.  The  next  [p,-]  bits  specify  p,-. 

5.  The  next  n  bits  of  the  program  specify  [A],  which  is  possible  by  the  third 
constraint  on  TV  , 

6.  The  encoding  then  repeats  the  following  for  i  =  1 ...  [A].  The  arc  is  specified 
by  bits,  which  is  possible  by  Theorem  4.1. 

4.3.3  Length  of  an  Encoding 

We  now  develop  the  expression  for  the  length  of  the  string  which  results  from  the 
encoding  procedure. 

Theorem  4.2  If[P]  =  0  then  i(e(r/))  =  2(n-l-l);  if[P]  =  1  then  L{e{rf))  <  1-1-2"; 
otherwise  L{e{rj))  =  1  -f  2n  -f  n^[A]  n(P]  -f  DFC{f), 


50 


Proof: 

When  [P]  =  0  there  is  the  bit  of  step  1,  the  n  bits  of  step  2  and  the  n  +  1  bits  of  step 
3.  When  [P]  =  1,  the  worst  case  is  when  there  are  no  vacuous  variables  and  then 
there  is  the  bit  of  step  1  and  the  2"  bits  of  /.  Otherwise:  Step  1  uses  1  bit.  Step  2 
uses  n  bits.  Step  4  uses  a  sum  as  i  =  1, 2, . . . ,  [P]  of  the  bits  required  for  part  a)  plus 
the  bits  required  for  part  b).  Part  a)  requires  n  bits.  Part  b)  requires  [p,]  bits.  The 
total  number  of  bits  required  for  Step  4  then  is: 

(P)  (PJ 

Y,{n  +  N)  =  n(/’]  +  EW  =  +  DFC(f). 

«=1  1=1 

Step  5  uses  n  bits.  Step  6  uses  a  sum  as  i  =  1  to  [^]  of  the  bits  required  for  an  arc. 
An  arc  requires  v?  bits,  assuming  n  >  4.  The  total  number  of  bits  required  for  Step 
6  then  is: 

Ml 

=  n^[A]. 

1=1 

Therefore, 

I(e(r/))  =  1  +  n  +  n[P]  +  DFG{f)  +  n  +  n^[A]  =  1  +  2n  +  n^[A]  +  n(P)  +  DFC{f). 
□ 


4.3.4  %'  Includes  All  Optimal  Representations 

As  mentioned  earlier,  K'  is  a  proper  subset  of  7Z  ;  for  example,  we  could  define  an  ry 
with  arbitrarily  many  identity  functions  so  [P]  would  be  larger  than  that  allowed  for 
encoding.  However,  TV  includes  most  reasonable  representations  and  we  prove  that  %' 
includes  all  “optimal”  representations.  For  every  /,  define  r/_opj  as  a  representation 
of  /  such  that  i(e(r^_opj))  <  P(e(r/))  for  all  tj  that  represent  /.  In  order  to  prove 
that  all  optimal  representations  are  in  TV  we  need  several  results  which  we  develop 
now. 

Theorem  4.3  For  n>3,  2{n  +  1)  <  P(e(r/_op<))  <  1  +  2". 

Proof: 

The  right  inequality  follows  since  an  arbitrary  /  has  a  representation  with  [P]  =  1 
(i.e.  p  =  f)  and  whose  length  is  1  +  2”.  The  left  inequality  follows  immediately 
if  is  not  a  decomposition.  If  is  a  decomposition  then  either,  L{e{rf))  = 
1  +  2n  +  +  n[P]  +  DFC{f)  >  2{n  +  1)  or  L(e(r/))  =  2(n  +  1). 

□ 


The  DFC  of  a  total  function  cannot  be  greater  than  its  cardinality. 
Theorem  4.4  DFC{f)  <  2". 


51 


Proof: 

From  Theorem  4.3,  i(e(r/_opt)),  <1  +  2".  By  Theorem  4.2,  1  +  2ti  +  n^[i4]  +  n[P]  + 
DFC{f)  <  1  +  2"  or  2n  +  n^[>l]  +  n[P]  +  DFC{f)  <  2".  Since  all  the  terms  on  t^ 
left  are  positive,  each  term  is  less  than  or  equal  to  2".  In  particular,  DFC{f)  <  2". 
□ 


No  component  of  a  decomposition  can  be  larger  than  the  whole  DFC. 

Theorem  4.5  [p,]  <  2"  /or  t  =  1, 2, . . . ,  [P]. 

Proof: 

From  Theorem  4.4,  DFC{f)  <  2".  Since  all  the  [p,]  terms  that  sum  to  DFC{f)  are 
positive,  each  term  must  be  less  than  or  equal  to  2". 

□ 


The  number  of  components  cannot  exceed  the  DFC. 

Theorem  4.0  [P]  <  2". 

Proof: 

n[P]  is  a  term  in  L{e{Tf.opt))  by  Theorem  4.2  and  must  be  less  than  or  equal  to  2" 
by  Theorem  4.3.  Finally,  since  n  >  1  we  have  [P]  <  2". 

□ 


The  number  of  arcs  cannot  exceed  the  DFC. 
Theorem  4.7  [A]  <  2". 

Proof: 

Follows  as  in  Theorem  4.6. 

□ 


Finally  we  can  prove  that  %'  includes  all  optimal  representations. 
Theorem  4.8  For  all  f  of  the  form  f  :  {0,1}"  — >  {0,1},  Vf^opi  S  TV. 
Proof: 

Follows  from  Theorems  4.5,  4.6,  and  4.7  and  the  definition  of  TV, 

a 


4.3.5  Properties  of  Encodings 

We  now  can  relate  the  encoded  representations  to  the  programs  of  Appendix  A. 
Theorem  4.9  e(r/)  is  a  program. 


52 


Proof: 

The  set  of  e(r/)  for  all >/  6  is  a  language  P  satisfying  the  prefix  condition.  For 
F  as  the  set  of  all  functions  of  the  form  /  :  {0, 1}"  -♦  {0, 1}  define  M  :  P  F  such 
that  ilf(e(r/))  =  /.  Under  these  conditions,  {P,  F,  M)  form  a  programmable  machine 
as  defined  in  Appendix  A. 

□ 


Because  e{rj)  is  a  program,  all  the  results  about  programs  apply  to  e(r/).  We 
want  to  highlight  a  few  of  these  results  in  the  present  context. 

Theorem  4.10  There  exists  a  function  f  such  that  L{e{rf-opt))  =  1  +  2". 

Proof: 

Suppose  to  the  contrary  that  no  such  function  existed.  That  is,  the  worst  cost  of  any 
function  is  2"  or  less.  We  know  there  exist  functions  with  cost  strictly  less  than  2", 
e.g.  a  constant  function.  In  this  situation  the  average  L{e{rf))  is  strictly  less  than 
2";  that  is,  we  have  an  average  of  a  set  of  finite  numbers  containing  some  numbers 
less  than  2"  but  containing  no  numbers  greater  than  2".  However,  since  e(r/)  is  a 
program  the  average  L{e{rj))  being  strictly  less  than  2"  contradicts  Theorem  A. 16. 
Therefore  the  supposition  is  false  and  the  theorem  is  proven. 

□ 


Theorem  4.11  2"  <  average  i(e(r/_opt))  <  1  +  2". 

Proof: 

The  left  inequality  follows  from  Corollary  A. 3.  For  the  right  inequality,  we  know  that 
there  exist  functions  with  cost  strictly  less  than  2"  +  1,  e.g.  a  constant  function,  and 
that  there  are  no  functions  with  cost  greater  than  2"  +  1  by  Theorem  4.3.  Therefore 
the  average  i(e(r/))  is  strictly  less  than  2"  +  1. 

□ 


4.3.6  Decomposed  Function  Cardinality  and  Program  Length 

We  are  concerned  with  the  question:  “How  well  does  DFC{f)  capture  the  essential 
complexity  of  a  function?”  In  Appendix  A  we  developed  the  idea  of  program  length 
as  a  very  general  characterization  of  size  complexity.  In  this  section  we  found  that 
e(r/)  is  a  program  with  length  L{e{rf))  =  l  +  2n+n^[A]+n[P]  +  DFC{f).  Therefore, 
one  step  in  relating  DFC{f)  to  general  complexity  is  to  assess  its  role  in  i(e(r/)). 

In  those  cases  where  L{e{rf^opt]  =  1  +  2",  DFC{f)  =  [/]  =  2"  =  i(e(r/_opf))  - 1. 
That  is,  in  most  cases  DFC[f)  is  almost  exactly  L[e[r}^opi))-  When  L{e{rf^opi))  < 

1  +  2",  we  do  not  have  as  simple  a  relationship.  However,  we  can  prove  that  DFC{f) 
is  roughly  of  the  same  order  of  complexity  as  L{e{rf^opt))- 


53 


We  already  know  that  L(e(r/_„p£))  cannot  be  small  unless  DFC{f)  is  small  since 
DFC{f)  is  a  part  of  JD(e(r/_opi)).  Our  concern  is  that  DF(7(/)  might  be  small  while 
X(c(r/_opf))  is  large.  That  is,  we  do  not  want  to  think  that  a  function  is  patterned 
based  on  DFC  while  it  really  is  not  patterned  when  you  consider  the  full  cost  of 
representation  as  measured  by  L{e{r j-opt))-  order  to  demonstrate  that  this  is  not 
a  problem,  we  will  show  that  X(e(r/_ope))  <  n^DFC{f).  That  is,  L{e{rj.opi))  is  never 
of  a  much  higher  order  than  DFC{f).‘* 

Theorem  4.12  DFG{f)  <  L{e{rf.opt))  <  r?DFG{f)  for  alln>i  and 
DFG{f)  >  1. 

Proof: 

The  left  inequality  follows  immediately.  For  the  right,  let  n,-  be  the  number  of  input 
variables  for  p,-.  Then  DFG{f)  =  sSilZ"'.  Since  n,-  >  2  we  have  DFG{f)  >  4[P]. 
Also,  [i4]  =  X)i=i  since  Ui  <nvfe  have  [A]  <  n[P]  + 1.  Now  use  the  second 

inequality  to  eliminate  [A]  from  the  expression  for  L{e[rj-opi))  to  get: 

L{e{rj.apt))  <  1  +  2n  +  n'^{n[P]  +  1)  +  n[P]  +  DFG{f) 

and  then  use  the  first  inequality  to  eliminate  [P]  to  get: 

L{e{rf-opi))  <  1  +  2n  +  n\n^DFG{f)  +  1)  +  n^DFG{f)  +  DFG{f) 
or  equivalently: 

L{e{rj.opt))  <  1  +  2n  +  +  ^DFG{f){n^  +  n  +  4) 

This  simplifies  to  L{e{rf^opi))  <  n^DFG{f)  forn  >  4  and  DFG{f)  >  1. 

□ 


From  Theorem  4.12  we  know  that  if  DFG{f)  is  small  then  program  length  is 
also  small.  For  example,  if  DFG{f)  is  polynomial  in  n  then  program  length  is  also 
polynomial  in  n.  The  point  is  that  DFG{f)  reflects  the  essential  complexity  even 
though  it  does  not  directly  include  a  measure  of  the  interconnection  complexity  of 
the  representation. 

There  is  another  indication  of  the  relative  importance  of  interconnection  complex¬ 
ity  versus  DFC.  If  the  interconnections  are  minimized  without  regard  to  DFC  then 
the  result  is  n  + 1  interconnections  (n  the  number  of  non- vacuous  variables).  That  is, 
all  functions  on  a  given  number  of  variables  have  the  same  minimum  interconnection 
complexity. 

[54]  Theorem  5.7.5. 


54 


4.4  Decomposed  Function  Cardinality  and  Time 
Complexity 

Another  concern  about  the  DFC{f)  measure  of  pattern-ness  is  that  it  is  based  on 
combinational  machines.  What  if  there  are  patterns  that  only  become  measurable 
with  respect  to  sequential  machines?  If  this  were  the  case  then  DFC{f)  would  not 
measure  the  patterns  of  interest  in  most  real  computing  problems.  Of  course,  this  is 
not  the  case.  In  order  to  demonstrate  that  DFC[f)  will  be  small  whenever  there  exists 
a  good  sequential  algorithm,  we  relate  DFG{f)  to  time  complexity  (the  traditional 
measure  of  Computational  Complexity  Theory).  There  are  several  differences  in  the 
two  perspectives  (time  complexity  versus  DFC{f))  that  must  be  considered.  One 
difference  is  that  DFC{f)  considers  complexity  in  terms  of  a  single  measure  of  a 
whole  finite  function  while  traditional  complexity  is  in  terms  of  a  particular  input 
to  an  infinite  function.  A  second  difference  is  that  Pattern  Theory  is  based  on  size 
complexity  while  traditional  measures  are  based  on  time  complexity.  The  objective 
of  this  section  is  to  demonstrate  that  size  and  time  complexity  are  simply  different 
perspectives  of  what  might  be  considered  the  essential  computational  complexity  of 
a  function. 

Why  not  just  use  time  complexity  in  the  first  place?  First  and  foremost,  the  time 
complexity  of  all  finite  functions  is  0(1).  Therefore,  time  complexity  does  not  differ¬ 
entiate  between  patterned  and  un-patterned  finite  functions.  Another  problem  with 
time  complexity  is  finding  out  what  it  is.  We  know  how  to  find  the  time  complexity 
of  an  algorithm.  But  we  do  not  know  whether  or  not  that  is  the  time  complexity  of 
the  function.  It  may  simply  be  a  poor  algorithm.  This  also  presupposes  that  you 
have  an  algorithm  for  the  function,  which  is  begging  our  question. 

Once  the  relationship  between  DFC{f)  and  traditional  time  complexity  is  es¬ 
tablished  we  are  able  to  apply  results  from  the  traditional  theory  of  computational 
complexity.  Also,  because  small  time  complexity  implies  small  DFC{f)  we  know 
that  patterns  as  measured  by  time  complexity  are  a  subset  of  patterns  as  measured 
by  DFCif). 

Traditional  time  complexity  t{n)  is  defined  in  terms  of  the  run-time  of  a  Turing 
Machine  [59]  which  realizes  a  function  of  the  form  /  :  {0,1}*  — >  {0,1},  when  the 
input  is  a  string  of  length  n.  On  the  other  hand,  DFC{f)  is  defined  as  a  measure 
of  a  realization  of  a  function  of  the  form  g  :  {0, 1}”  —*  {0, 1}.  In  order  to  be  able  to 
compare  these  two  measures  we  use  the  following  device.  Let  /  :  {0, 1}*  — >  {0, 1}  be 
a  function  and  f„  :  {0, 1}”  — >  {0, 1}  for  n  =  0, 1, 2, ...  be  a  sequence  of  functions  such 
that  /n(®)  =  f  {x)  for  aU  x  in  {0, 1}"  and  for  all  n  in  N.  The  DFG  complexity  of  a 
representation  of  /„  is  denoted  ^(n). 

We  developed  this  device  and  the  following  theorem  not  knowing  that  a  similar 
result  had  been  previously  demonstrated  (see  [19,  55]).  However,  because  this  is 
fundamentally  important  to  Pattern  Theory,  we  want  to  present  a  proof.  Rather 
than  repeating  the  original  proof,  we  present  our  proof  and  suggest  that  the  reader 
see  [64,  pp. 271-276]  for  a  more  elegant  demonstration  of  this  result. 


55 


Theorem  4.13  If  f  :  {0,1}*  — >  {0,1}  is  a  function  with  an  algorithm  of  worst-case 
time  complexity  t{n)  then  for  every  n  G.  N,  there  exists  a  combinational  realization  of 
f„  such  that  5(n)  is  in  0(<(n)logi(n)). 

Proof: 

Let  M  =  {S,r,  Q,9o»^)-f’}  be  the  Turing  Machine  which  realizes  /  with  worst  case 
time  complexity  t(n),  where  F  is  the  tape  alphabet,  S  is  a  subset  of  F  called  the 
input  alphabet,  Q  is  a  set  of  machine  states,  is  the  start  state,  F  is  a  set  of  final 
states,  and  5:FxQ— »Fx(Jx  {— 1,+1}  is  the  transition  function  (reference  [59, 
pp.211-215]).  The  L  (for  left)  and  R  (for  right)  of  Sudkamp’s  definition  have  been 
replaced  with  -1  and  +1;  the  reason  for  this  will  be  pointed  out  later.  We  break  8 
into  S'  :T  X  Q  Tf  8"  i  T  X  Q  —>  Q,  and  S'"  :T  x  Q  {— 1,+1}  such  that  S{g,q) 
equals  {S'{gy  g),  S"{g,  q),  S'"{g,  q)). 

Define  the  interval  Z  =  {0, 1,2, . . . ,t(n)}  and  the  function  p  :  Z  Z  such  that 
p{i)  is  the  position  of  the  tape  after  i  transitions.  Define  q  :  Z  Q  such  that  q{i)  is 
the  state  of  the  machine  after  i  transitions.  Define  g  :  Z  x  Z  T  such  that  g{i^j) 
is  the  symbol  in  the  position  of  the  tape  after  i  transitions.  We  use  p,q,g  and  h 
to  establish  a  combinational  realization  of  /„.  Define  h  :  F'^")  x  Z  x  Q  —*  F‘("^  such 
that  g{ij)  -  h{g{i  -  l,i),p(t  -  l),g(i  -  1))  for  all  i  and  j. 

From  the  starting  conditions  of  the  Turing  Machine  p(0)  =  l,g(0)  =  go,  and 
g(0,})  =  x{j)  (i.e.  the  input)  for  j  =  1,2, ...  ,n.  Prom  the  final  conditions,  p{t{n))  =  0 
and  q{t{n))  is  the  state  defined  to  correspond  to  /(x)  =  0  and  /(x)  =  1  as  appropriate, 

By  the  definition  of  a  Turing  Machine  we  can  write  difference  equations  for  p,  g, 
and  g: 

p{i)  =  p(i  -  1)  +  S'"{g{i  -  l,p(i  -  l)),g(i  -  1))  for  t  =  l,2,...,t(n).  Using 
{— 1,4-1}  as  the  range  of  8'"^  rather  than  {L, /?},  allows  this  simple  addition.  The 
increment  in  cost  associated  with  each  i  is  the  cost  of  addition  plus  the  cost  of  5"', 
i.e.  a(4)45(^''0>  ^be  cost  of  all  the  other  functions  are  accounted  for  elsewhere. 
The  cost  of  “4”  is  0{m)  (reference  [7]  where,  since  the  largest  number  to  be  added 
is  t{n),m  =  logt(n).  That  is,  3(4)  =  fclogt(n).  For  S'"  :  T  x  Q  -*  {—1,41}, 
we  have  s{S'")  =  [F][Q]log[{— 1,41}]  =  [F][Q]  =  k,  k  some  constant.  Therefore, 
s{p)  =  ki  log(i(n))  4  fca. 

g(i)  =  S''{g{i  -  l,p(t  -  1)),  q{i  —  1)),  for  i  =  1, 2, ... ,  t{n).  The  increment  in  cost 
associated  with  each  i  is  s{S''),  since  the  cost  of  all  the  other  functions  are  accounted 
for  elsewhere.  For  8"  :  V  x  Q  Q,  we  have  3{8")  =  [F][Q]log[(5]  =  k,  k  some 
constant.  Therefore,  3(g)  =  k. 

g{i,j)  is  unchanged  for  all  j  except  j  =  p(x—  1)  and  g(i,p(i  - 1))  =  S'{g{i  -  l,p(z  - 
1)),  g(i  - 1)).  That  is,  g{i,j)  =  h{g{i  - 1,  j),p(i  -  !)>  q{i  -  1))  =  .NOT.{.EQ.{j,p{i  - 
1)))  X  g{i  -  l,j)  4  ■EQ.{j,p{i  -  1))  x  8'{g{i  —  l,y),g(i  -  1)),  where  the  functions 
.NOT.  and  .EQ.  are  the  obviou«!  logical  operators  with  values  0  or  1  and  4  and  x 
are  arithmetic  operators.  The  increment  in  cost  associated  with  h  is  s{.NOT.)  4 
s(4)  4  3(^')  4  23{.EQ.)  4  23(x).  The  multiplications  always  involve  a  zero  or  a 
one,  therefore  we  assume  that  3(x)  is  constant.  The  addition  always  involves  a  zero, 
therefore  we  assume  s(4)  is  constant.  The  cost  of  the  logical  operators  is  constant. 


56 


Figure  4,7:  Similar  Decompositions,  One  Recursive,  One  Not 

For  5'  :  r  X  Q  — >  r,  we  have  3(6')  =  [r][Q]log[r]  which  is  again  some  constant. 
Thus,  s(h)  =  fc,  for  some  constant  k.  There  must  be  an  h  for  each  tape  position.  A 
maximum  of  <(n)  tape  positions  will  be  used,  therefore,  we  know  there  can  be  less 
than  or  equal  to  t{n)  h^s.  The  total  cost  then  for  updating  g{i,j)  for  all  j  at  each  i 
is  kt{n). 

Each  transition  can  be  combinationally  realized  with  a  cost  of  s(p)  +  s(g)  +  s(/i)  = 
kit{n)  +  ^2  +  fc3log(<(n)).  The  entire  process  can  be  combinationally  realized  by 
modeling  t{n)  transitions.  Inputs  which  require  less  than  t{n)  transition  can  be  dealt 
with  by  modifying  S  to  include  a  no>op  type  transition.  Therefore,  the  total  com¬ 
binational  complexity  is  log(t(n))  -f  ^2  +  fc3log(<(n))}.  That  is,  s(n)  is  in 

0(t(n)log(<(n))). 

□ 


An  immediate  consequence  of  this  Theorem  is  that  if  it  is  not  possible  to  find  a 
good  (relative  to  DFG)  combinational  realization  of  a  function  then  it  is  also  not 
possible  to  find  a  good  (relative  to  time  complexity)  algorithm  for  the  function.  In 
other  words,  if  a  method  for  finding  optimal  combinational  realizations  fails  to  find  a 
nice  representation  of  a  function  then  there  does  not  exist  a  nice  algorithm  to  compute 
that  function  either. 

The  results  in  computability  theory  demonstrate  that  recursion  is  the  key  property 
that  a  function  must  have  to  be  computable.  It  is  tempting  then  to  extend  this  and 
say  that  recursion  is  a  key  property  that  a  function  must  have  to  be  patterned  (i.e.  not 
complex).  However,  DFC{f)  does  not  favor  functions  with  recursive  representations. 
For  example.  Figure  4.7  shows  two  functions  of  equal  DFG.  The  function  on  the 
left  has  a  recursive-like  representation  while  the  function  on  the  right  does  not.  We 
feel  that  the  problem  lies  in  trying  to  extend  computability  results  to  complexity 


57 


rather  than  in  DFC{f).  Recursion  is  important  in  getting  finite  representations  of 
infinite  functions  (see  Appendix  A);  however,  it  is  not  crucial  to  the  complexity  of 
finite  functions.  There  is  a  real  cost  savings  in  recursively  reusing  components  of  a 
composition;  however,  we  believe  that  this  is  a  secondary  effect  and  it  is  not  indicative 
of  the  existence  or  absence  of  patterns.  It  would  be  easy  to  redefine  DFC  to  reflect 
the  economy  of  reusing  functions  in  a  composition;  for  example,  we  could  include  only 
the  unique  p,’s  in  adding  up  DFC.  However,  finding  this  newly  defined  DFC  would 
be  much  more  difficult.  We  feel  that  this  improvement  in  generality  is  not  worth  the 
loss  in  tractability. 

In  summary,  it  has  been  proven  that  if  a  function  has  low  time  complexity  then 
it  also  has  low  DFC.  This  means  that  if  a  function  has  a  nice  representation  using 
sequential  constructs  (do-while,  recursion,  etc.)  then  it  also  has  a  nice  combinational 
representation.  The  main  point  is  that  there  is  no  loss  in  generality  due  to  the  PT  1 
restriction  to  combinational  machines. 


4.5  Decomposed  Function  Cardinality  and  Cir¬ 
cuit  Complexity 

There  has  been  a  great  deal  of  theoretical  work  on  measuring  and  minimizing  the 
complexity  of  electronic  circuits.  Using  Savage’76  [54]  as  the  principal  reference,  we 
review  the  several  measures  of  complexity.  “Combinational”  complexity  is  the  num¬ 
ber  of  circuit  elements  required  to  realize  a  function.  “Formula  size”  is  the  number 
of  operators  in  an  algebraic  expression  of  the  function.  “Delay  complexity”  is  the 
length  of  the  longest  path  from  input  to  output  in  a  realization  of  a  function.  Combi¬ 
national  complexity  is  very  similar  to  Decomposed  Function  Cardinality  (DFC).  We 
will  discuss  their  relationship  first.  The  relationship  of  combinational  complexity  to 
formula  size  and  depth  complexity  is  then  summarized  from  Savage’76. 

Combinational  complexity  in  Savage’76  is  defined  relative  to  a  set  of  basis  func¬ 
tions  (fl).  A  given  Boolean  function  /  is  then  realized  by  combinations  of  the  elements 
of  D.  The  combinational  complexity  C’n(/)  of  a  Boolean  function  /  relative  to  the 
basis  fl  is  the  minimum  number  of  elements  from  fl  required  to  realize  /. 

Cu{f)  is  sometimes  generalized  by  allowing  the  various  elements  of  fl  to  have 
different  costs  [52].  For  example,  we  could  define  a  weighting  function  (u;  :  f)  71, 
%  the  real  numbers)  on  fi.  Then  the  generalized  cost  of  a  function  (C'n,ui(/))  is  the 
sum  of  the  weights  of  the  elements  in  the  realization  that  minimizes  that  sum.  Cn(/) 
is  the  special  case  of  C'n,u;(/)  where  a;  is  the  constant  1  function. 

DFC  is  also  a  special  case  of  this  generalized  combinational  complexity.  DFC  is 
exactly  (7n,w(/)  when  fl  is  the  set  of  all  Boolean  functions  and  oj{p)  =  [p]  for  all  p  €  D. 

Note  that  for  the  most  common  set  of  basis  functions,  fi  =  {AND,  OR,  NOT}, 
Cu{f)  and  DFC  are  also  very  similar. 

Theorem  4.14  For  a  Boolean  function  f  and  basis  set  Q  =  [AND,  OR,  NOT}, 
DFC{f)  <  4C7n(/)  <  AnDFG{f). 


58 


Proof: 

DFC  <  4(7n  since  each  element  of  0,  has  cardinality  of  4  or  less.  Suppose  the  decom¬ 
position  of  /  with  minimum  DFC  is  made  up  of  pi,p2>  •  *  ‘  number  of  vari¬ 

ables  going  into  Pi  is  n,-.  In  this  case,  DFC{f)  =  Sfli  2"'.  Now,  C'n(/)  <  ]C;=i  Co(pi). 
Since  each  p,-  has  a  sum-of-products  representation,  we  have  C'n(p;)  <  ni2"'[54,  p.l9]. 
Therefore,  (7n(/)  <  Ef=i  ni2"'  <  nEf=i  =  nDFC{f). 

□ 


In  summary,  Cq  is  in  some  ways  more  general  than  DFC;  but,  it  is  more  general 
in  a  way  that  denies  any  absolute  meaning  to  complexity.  That  is,  for  all  functions 
/  there  exist  an  fl  (namely  any  fl  with  /  as  an  element)  such  that  C'n(/)  =  !•  This 
relativity  of  complexity  to  a  chosen  basis  is  believed  by  some  to  be  unavoidable. 
'The  main  idea  of  Pattern  Theory  is  that  there  is  some  general  absolute  measure  of 
complexity  in  the  sense  of  patterns. 

C{AND,OR,NOT]  is  quantitatively  similar  to  DFC  but  it  leads  you  to  artificially 
decompose  (i.e.  represent  in  some  normal  form)  un-patterned  functions.  In  Pattern 
Theory,  all  functions  have  themselves  as  a  normal  form  representation.  Also,  on 
smaller  functions  it  gives  an  artificial  importance  to  members  of  D. 

Savage’76  has  little  to  say  about  the  relationship  of  combinational  complexity  to 
formula  size  and  depth  complexity.  Depth  complexity  is  proportional  to  the  logarithm 
of  Cfi,  Also,  Cq  <  formula  size.  Other  than  that,  not  much  is  known  about  the 
relationship  between  combinational  complexity,  depth  complexity  and  formula  size. 


4.6  Summary 

This  chapter  defines  our  chosen  figure-of-merit  for  algorithm  design:  Decomposed 
Function  Cardinality  (DFC).  We  then  support  the  idea  that  DFC  reflects  the  essential 
pattern-ness  of  a  function.  Chapter  6  has  the  results  of  many  experiments  that 
support  this  idea.  In  this  chapter  we  supported  this  idea  by  showing  that  if  a  function 
has  an  interesting  pattern  by  most  other  measures  then  it  also  has  a  pattern  according 
to  DFC.  We  considered  three  other  measures  of  computational  complexity:  program 
length,  time  complexity  and  circuit  complexity. 

Appendix  A  interprets  a  result  from  Communications  Theory  in  terms  of  programs 
and  the  length  of  a  program  (essentially  the  number  of  characters  in  a  listing  of  a 
program).  It  turns  out  that  a  similar  interpretation  had  previously  been  made  [12]. 
We  then  related  DFC  and  program  length.  This  relation  supports  the  use  of  DFC 
rather  than  program  length  as  our  measure  of  patterns  because  DFC  is  more  tractable 
and  yet  reflects  the  essential  complexity  that  program  length  would  measure. 

We  developed  a  formal  proof  of  the  relationship  between  DFC  and  time  complexity. 
It  turns  out  that  this  relationship  had  also  been  previously  demonstrated  [54].  The 
relationship  between  DFC  and  time  complexity  demonstrates  that  DFC  is  a  more 
general  measure  of  complexity  (that  is,  anything  patterned  relative  to  time  complexity 
will  also  be  patterned  relative  to  DFC).  DFC  also  has  the  advantage  over  time 


59 


complexity  in  that  it  is  meaningful  for  finite  functions  and  allows  for  a  method  of 
design  (Chapter  5). 

The  relationship  between  DFC  and  circuit  complexity  is  quite  simple.  DFC  is  a 
special  case  of  the  more  general  definition  of  circuit  complexity.  We  believe  it  is  the 
special  case  that  reflects  the  essential  pattern-ness  of  a  function. 

We  mentioned  earlier  that  it  would  be  desirable  to  include  the  cost  of  the  intercon¬ 
nections  in  a  general  measure  of  patterns,  as  in  the  program  length  of  an  encoding  of  a 
decomposition.  It  would  also  be  desirable  for  a  general  measure  to  reflect  the  savings 
of  reusing  components,  as  in  a  recursive  representation.  However,  we  believe  that 
DFC  allows  you  to  determine  the  basic  degree  of  pattern-ness  without  the  complexity 
of  dealing  with  these  secondary  effects. 


« 


60 


Chapter  5 


Function  Decomposition 


5.1  Introduction 

By  a  “function”  we  mean  the  traditional  mathematical  function;  that  is,  a  function 
is  an  association  between  inputs  and  outputs  such  that  for  every  input  x  there  is 
exactly  one  output  /(*).  Functions  in  general  may  have  several  inputs,  e.g.  f[x,y,z). 
The  decomposition  of  a  function  is  an  expression  of  that  function  in  terms  of  a  com¬ 
position  of  other  (usually  smaller)  functions.  For  example,  if  /(a5i,®2> •  •  •  »®8)  = 
F[A[(9(a;i,®2)»^(®3)®4)>V’(®5>®6)]>C(®7>®8)]  then  the  right-hand  side  of  the  equation 
is  a  decomposition  of  the  function  /.  The  process  of  finding  a  decomposition  of  a 
function  is  called  function  decomposition. 

Function  decomposition  is  of  practical  importance  in  the  design  of  computational 
systems.  The  realization  of  a  large  function  in  terms  of  smaller  functions  has  a  number 
of  practical  benefits,  especially  simplifying  the  design  process  and  lowering  the  cost  of 
the  overall  realization.  The  development  of  function  decomposition  theory  has  been 
motivated  primarily  by  Switching  (or  Logic)  Circuit  Design,  where  the  lowest  level 
sub-functions  are  realized  by  some  standard  circuit  component  (e.g.  an  AND  gate). 
Pattern  Theory  holds  that  function  decomposition  is  of  fundamental  importance  in 
the  development  of  all,  computing  systems,  including  algorithm  design  and  machine 
learning.  Aside  from  the  practical  motivation,  function  decomposition  is  a  well  defined 
and  very  interesting  mathemaiical  problem. 

This  chapter  has  three  main  sections.  Section  5.2  introduces  and  formally  proves 
the  test  for  decomposability.  The  next  section  describes  the  Ada  program  used  in 
this  project  to  decompose  functions.  The  final  section  reports  on  performance  tests 
of  various  versions  of  the  Ada  program. 


61 


5.2  The  Basic  Decomposition  Condition 

5.2.1  Introduction 

A  number  of  algorithms  have  been  developed  to  find  the  decompositions  of  a  given 
function.  All  these  algorithms  are  iterative.  That  is,  they  decompose  the  function 
in  a  series  of  similar  steps.  The  first  step  decomposes  the  function  into  a  small 
number  of  sub-functions.  The  second  step  decomposes  each  of  these  sub-functions 
into  a  small  number  of  sub-subfunctions.  This  process  is  repeated  until  the  remaining 
functions  are  no  longer  decomposable.  The  function  decomposition  algorithms  use 
a  fairly  standard  test  for  whether  or  not  a  function  (or  sub-function)  decomposes. 
This  test  is  based  on  what  we  call  the  basic  decomposition  condition.  On  the  other 
hand,  the  function  decomposition  algorithms  use  different  methods  for  searching  and 
selecting  among  possible  decompositions.  An  optimal  method  of  search  and  selection 
(other  than  exhaustive  search)  has  not  been  defined.  The  basic  decomposition  is 
the  common  non- heuristic  portion  of  these  algorithms.  The  basic  decomposition 
condition  test  is  an  exponentially  (relative  to  the  number  of  variables  in  the  function) 
difficult  problem.  The  number  of  variable  partitions  is  another  exponential  factor, 
however,  this  factor  can  perhaps  be  mitigated  with  reasonable  search  heuristics.  The 
nature  of  the  decomposition  test  is  such  that  there  is  no  way  to  limit  its  exponential 
complexity.  Therefore,  the  decomposition  test  is  also  important  as  a  limiting  factor 
in  the  practicality  of  any  algorithm  which  exactly  decomposes  total  functions. 

The  purpose  of  this  section  is  to  develop  a  concise,  yet  rigorous  statement  of  a 
general  basic  decomposition  condition.  Many  of  the  previously  published  statements 
of  the  basic  decomposition  condition  have  shortcomings  in  rigor,  generality,  or  in 
requiring  an  extensive  background  in  non-essential  materials.  For  example,  the  bible 
of  function  decomposition  is  the  text  by  H.  A.  Curtis  [14].  However,  its  not  until 
page  471  that  the  most  general  form  of  the  decomposition  condition  is  given  and  an 
understanding  of  much  of  the  previous  470  pages  is  required  to  understand  page  471. 
Reference  [14]  also  does  not  prove  the  most  general  statement  of  the  decomposition 
condition,  rather  it  is  an  extension  of  the  proofs  of  less  general  forms.  The  most 
general  form  given  in  [14]  is  also  not  applicable  to  multi-valued  functions.  Finally, 
[14]  has  been  out  of  print  for  some  time  and  it  is  very  difficult  to  find. 

5.2.2  An  Intuitive  Introduction  to  the  Decomposition  Con¬ 
dition 

Consider  a  function  of  the  form  /  :  Xi  x  A'^2  x  •  •  •  x  A”,,  ->  Y.  This  function  is 
also  denoted  /(xi,X2,...,x„),  where  ®,  represents  some  unspecified  value  of  Xi, 
{®i,®2)  •  •  •  »®n}  is  called  the  set  of  variables  of  /.  We  are  interested  in  a  partition  of 
the  variables  of  /  into  two  sets.  A  partition  is  a  collection  of  subsets  whose  union  is 
the  whole  set  and  whose  intersections  are  all  empty.  We  denote  the  two  sets  of  vari¬ 
ables  Vi  and  V2.  Therefore,  Ui  U  V2  =  {® i ,  ®25  •  •  •  i  ®ii}  and  Ui  n U2  =  0.  For  example,  if 


62 


Xi 

X2 

®3 

X., 

/(xi,X2,X3,X.i) 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

1 

0 

1 

0 

0 

1 

1 

0 

0 

1 

0 

0 

1 

0 

1 

0 

1 

0 

0 

1 

1 

0 

1 

0 

1 

1 

1 

1 

1 

0 

0 

0 

1 

1 

0 

0 

1 

0 

1 

0 

1 

0 

0 

1 

0 

1 

1 

0 

1 

1 

0 

0 

1 

1 

1 

0 

1 

d 

1 

1 

1 

0 

1 

1 

1 

1 

1 

1 

Table  5.1:  A  Table  Representation  of  a  Function 


n  =  8,  then  one  partition  is  (vi  =  {®1,®3,«4,®6>®7}»V2  =  {®2>®5>®8})  and  a  second 
partition  is  (vj  =  {®2,®3j®4i®6>®7})^2  =  {®i»®5>®8})* 

A  finite  function  can  be  represented  by  a  “truth-table”  where  all  possible  values 
for  the  input  are  listed  with  their  corresponding  image  under  the  function.  Table  5.1 
is  a  table  defining  a  function. 

We  might  call  Table  5.1  a  one- dimensional  table  since  it  lists  the  values  in  a  vertical 
line.  We  can  also  represent  a  function  with  a  two-dimensional  table  by  letting  the 
values  of  the  variables  in  vi  mark  off  one  direction  while  the  values  of  the  variables  in 
V2  mark  off  the  orthogonal  direction.  The  value  of  the  function,  for  the  given  values 
of  uj  and  U2,  then  go  into  the  “matrix”  (the  2-D  table)  at  the  coordinates  (ui,  U2). 
For  example,  if  Vi  =  {®i,®2}  and  v-^  =  {®3,®4}  then  the  function  of  Table  5.1  could 
also  be  represented  as  the  2-D  table  of  Table  5.2. 

In  Table  5.2,  xj  and  X2  specify  a  column  of  the  2-D  table  while  X3  and  x.i  specify 
a  row  of  the  table.  Of  course,  xj,  X2,  X3,  and  X4  specify  a  particular  point  in  the 
table  corresponding  to  the  value  of  the  function  at  xi,  X2,  ®3,  and  X4.  A  2-D  table 
of  a  function  is  called  a  partition  matrix.  A  different  partition  of  the  variables  would 
give  a  different  arrangement  of  /’s  values.  For  example,  with  respect  to  the  partition 
^1  =  {®i)®4}  and  V2  =  {®2i®3}  the  2-D  table  becomes  as  in  Table  5.3. 

There  could  also  be  partitions  with  unequal  numbers  of  variables  as  in  Table  5.4. 
If  one  of  the  sets  of  variables  is  empty  we  have  the  familiar  1-D  table  as  in  Table  5.1. 

Now  examine  Table  5.5.  Notice  that  all  the  columns  are  the  same.  It  does  not 


63 


XIX2 

f 

00 

01 

10 

11 

00 

0 

1 

1 

1 

*3X4  01 

1 

0 

0 

0 

10 

1 

1 

0 

1 

11 

0 

1 

0 

1 

Table  5.2:  A  2-D  Table  of  a  Function  With  Respect  to  a  Partition  of  its  Variables 


XIX4 

f 

00 

01 

10 

11 

00 

0 

1 

1 

0 

X2X3  01 

1 

0 

0 

0 

10 

1 

0 

1 

0 

11 

1 

1 

1 

1 

Table  5.3:  A  Second  2-D  Table  of  a  Function  With  Respect  to  a  Partition  of  its 
Variables 


®i 

f 

0  1 

000 

0  1 

X2X3X4  001 

1  0 

010 

1  0 

on 

0  0 

100 

1  1 

101 

0  0 

no 

1  1 

111 

1  1 

Table  5.4:  A  Table  Representation  of  a  Function 


64 


®1®2 

f 

00 

01 

10 

11 

00 

1 

1 

1 

1 

®3®4  01 

0 

0 

0 

0 

10 

0 

0 

0 

0 

11 

1 

1 

1 

1 

Table  5.5:  A  2-D  Table  of  a  Function  With  Respect  to  a  Partition  of  its  Variables 


matter  which  column  is  specified  by  ii®2*  other  words,  the  value  of  and  X2  do 
not  im.pact  the  value  of  /.  The  value  of  /  depends  only  on  the  value  of  *3  and  (i.e. 
which  row  of  the  2-D  table).  Therefore,  there  exists  a  function  F  :  {0, 1}^  — »  {0,1} 
such  that  jP(®3,®4)  =  /(®i,®2*®3}®4)  for  all  ®i  in  Xi,  all  ®2  in  X2,  all  ®3  in  A3,  and 
all  ®4  in  A4.  In  a  very  reasonable  sense  then,  F  is  a  complete  representation  of  /. 

Another  way  of  saying  that  all  the  columns  of  /’s  partition  matrix  are  the  same 
is  to  say  that  there  is  only  one  distinct  column  in  /’s  partition  matrix.  The  number 
of  distinct  columns  is  an  important  part  of  the  basic  decomposition  condition  and  is 
called  the  column  multiplicity  and  denoted  v. 

From  the  above  example,  we  see  that  if  1/  =  1  then  the  variables  associated  with 
the  columns  of  the  partition  matrix  can  be  dropped.  That  is,  if  1/  =  1  then  there 
exists  a  function  (F)  on  the  row  variables  only,  which  is  equal  to  /.  We  can  begin  to 
see  the  relationship  between  u  and  the  decomposability  of  a  function. 

Now  consider  the  function  g  defined  in  Table  5.6.  With  respect  to  the  partition 
of  variables  vi  =  {®i,®2}  and  =  {®3,®4},  g  has  the  partition  matrix  of  Table  5.7. 

Notice  that  there  are  two  distinct  columns  in  flf’s  partition  matrix.  That  is,  u  =  2. 
In  this  case  it  is  not  possible  to  drop  the  Vi  (i.e.  Xi  and  ®2)  variables.  Without  the 
variables  there  is  an  ambiguity  in  the  value  of  the  function  when  (®3,®4)  =  (0,1)  or 
(1,1).  Even  though  we  cannot  drop  the  vi  variables,  the  Vi  variables  are  really  only 
needed  to  distinguish  between  the  two  distinct  columns.  Therefore,  we  can  define  a 
function  <f> :  Xi  x  X2  Z  which  selects  the  appropriate  column  when  given  ®i  and 
®2.  For  example,  2  =  0  indicates  the  first  column  and  z  =  1  indicates  the  second 
column.  There  also  exists  a  function  G  which  'akes  2  (to  select  between  the  two 
distinct  columns),  ®3  and  ®4  as  input,  and  repre.  "nts  g,  that  is,  flf(®i,®2>®3j®i)  = 
G(^(®i,®2),®3>®4)‘  See  Table  5.8  and  Table  5.9  wi.ere  g  is  defined  using  G  and  0. 

From  the  preceding  example,  we  see  that  if  1/  =  2  for  a  function  g  with  respect 
to  the  partition  (vi,V2)  then  there  exist  functions  G  and  <j)  such  that  fif(vi,U2)  = 
G{<p{Vi),V2). 

Summarizing  the  prior  two  examples,  if  j/  =  1,  then  the  vi  variables  can  be  dropped 
(or  reduced  to  a  “unary”  variable)  and  il  u  =  2  then  the  ui  variables  can  be  reduced 
to  a  binary  variable.  A  trend  in  the  relationship  between  u  and  decomposability  is 
beginning  to  develop. 


65 


®1 

®2 

®3 

®4 

g{xif  X2)  ®3}  *4) 

0 

0 

0 

0 

1 

6 

0 

0 

1 

0 

0 

1 

0 

0 

>0 

0 

1 

1 

1 

0 

1 

0 

0 

1 

0 

1 

0 

1 

1 

0 

1 

1 

0 

0 

0 

1 

1 

1 

0 

1 

0 

0 

0 

1 

1 

0 

0 

1 

1 

1 

0 

1 

0 

0 

1 

0 

1 

1 

0 

1 

1 

0 

0 

1 

1 

1 

0 

1 

0 

1 

1 

1 

0 

0 

1 

1 

1 

1 

1 

Table  5.6:  A  Table  Representation  of  Function  g 


1 

1 

1 

)  1 

1 

0 

)  0 

0 

0 

0 

0 

1 

Table  5.7:  A  Partition  Matrix  (2-D  Table)  of  g 


®1®2 

9 

00 

01 

10 

11 

0 

1 

1 

0 

00 

1 

1 

1 

1 

01 

0 

1 

1 

0 

10 

0 

0 

0 

0 

11 

1 

0 

0 

1 

Table  5.8:  A  Partition  Matrix  of  g  With  (j)  Defined 


G 

<l> 

0  1 

00 

1  1 

X3X4  01 

0  1 

10 

0  0 

11 

1  0 

Table  5.9:  g  Defined  by  G  and  <f) 


Consider  one  final  example,  which  we  only  outline.  Suppose  h  :  Xi  x  X2  x  X3  x 
X4  y  has  1/  =  3  for  the  partition  of  variables  Vi  =  and  V2  =  {x3,x^}. 

Then,  as  in  the  second  example,  we  can  define  t) :  X\xX2  Z  and  H  :  ZxX^xX^  -* 
Y  such  that  ^(®i,®2»X3>X4)  =  if(t;(xi,X2),®3,X4),  except  now  we  must  have  three 
elements  in  Z  (e.g.  Z  =  {0,1,2})  to  distinguish  the  three  distinct  columns.  Clearly, 
for  any  function  /  :  Xi  x  X2  x  ...  x  Xn  — >  F,  as  long  as  Z  has  as  many  as  u 
elements  there  will  always  exist  functions  <j)  :  V\  -*  Z  and  F  Z  XV2  Y  such 
that  /(xi,X2,...,x„)  =  F{<f>{vi),V2).  For  example,  if  1/  =  4  then  Z  =  {0, 1,2,3}  is 
sufficient  for  a  decomposition  of  the  form  of  the  prior  examples.  This  is,  essentially, 
the  decomposition  condition.  Once  u  is  determined  for  a  partition  of  a  function’s 
variables,  we  know  how  big  Z  must  be  for  the  function  to  decompose  with  respect  to 
that  partition.  In  particular,  for  any  function  f  Xi  x  X2  x  ...  x  Xn  Y ,  if  u  <|Z1 
with  respect  to  variable  partition  Vi,V2,  then  there  exist  functions  <f>  i  Vi  -*  Z  and 
F  :  Z  XV2  -^Y  such  that  /(xi,X2...x„)  =  F(^(t;i),V2). 

In  the  above  examples  we  assumed  that  Z  is  of  the  form  (1,2,3, ...,  [^]}.  We 
could  have  used  a  set  of  vectors  for  Z,  e.g.  Z  =  {0,1}^  =  {(0,0),(0, 1),(1,0),(1, 1)}. 
When  Z  is  a  vector  set,  we  can  think  of  as  a  single  vector  valued  function,  or  as 
a  vector  of  scalar  valued  functions.  For  example,  if  Vj  =  (0, 1}'^,  V2  is  (0, 1},  and 
1/  =  4,  then  we  could  define  any  of  the  following  decompositions:  /(xi,X2,X3,X4)  = 
F’(^(xi,X2,X3),x,i)  =  F'(^'(xi,X2,X3),x,,)  =  F"((^"(xi ,  X2,  X3)<^"'(xi ,  X2,  X3),  x^)  where 

^  (0, 1,2,3},  0'  :  Fi  ->  {0,lp,  :  F,  {0,1},  and  f  :  VI  {0,1}.  Ta¬ 

ble  5.10  shows  examples  for  the  various  ^’s.  Therefore,  a  basic  decomposition  does 
not  necessarily  have  exactly  two  component  functions.  In  particular,  there  can  be 
several  ^’s. 

When  Z  is  a  vector  set,  its  cardinality  is  the  product  of  the  cardinalities  of  the 
sets  in  the  product  of  sets  forming  Z.  That  is,  if  Z  =  Xi  x  Z2  x  X3  X  . . .  x  Xjt 
then  [Z]  =  [A’i][X2][X3]  . . .  (X*].  For  cost  considerations  (discussed  later)  we  are 
interested  in  keeping  [Z]  as  small  as  possible.  We  also  want  to  maximize  k  since  this 
will  provide  more  opportunities  for  decomposing  F.  The  ideal  situation  is  for  u  to 
be  some  power  of  2;  in  this  case  Z  =  {0, 1}*',  where  k  =  log(i/),  meets  both  of  our 
objectives.  However,  if  we  insist  on  using  a  product  of  some  set  X  then  when  is 
just  slightly  larger  than  a  power  of  [X],  (Z)  is  almost  twice  as  large  as  really  required. 
Theorem  5.1  provides  one  possible  point  in  the  [Z]  and  k  trade-off. 


67 


®1 

®2 

®3 

r 

r 

0 

0 

0 

0 

(0,0) 

0 

0 

0 

0 

1 

2 

(1.0) 

1 

0 

0 

1 

0 

1 

(0.1) 

0 

1 

0 

1 

1 

1 

(0.1) 

0 

1 

1 

0 

0 

0 

(0.0) 

0 

0 

1 

0 

1 

3 

(1,1) 

1 

1 

1 

1 

0 

2 

(1,0) 

1 

0 

1 

1 

1 

0 

(0.0) 

0 

0 

Table  5.10:  Various  Forms  of  Z 


VI 

V2 


■fivuVi) 


Figure  5.1:  Form  of  a  Decomposition 


Theorem  5.1  For  minimum  [Z],  (i.e.  [Z]  =  u),  the  number  of  variables  in  Z  = 
Xi  X  X2  X  X  . . .  X  Xfc  is  maximized  if  (Xi],[X2],[X3],...,[Xfc]  ore  the  prime 
factors  of  V. 


There  is  one  final  twist  to  the  decomposition  problem  that  we  want  to  discuss  at 
an  intuitive  level.  In  the  decomposition  condition  just  discussed,  we  were  considering 
decompositions  of  the  form  of  Figure  5.1. 

A  generalization  of  this  form  is  to  allow  some  of  the  variables  which  are  inputs  to 
<l>  to  also  be  inputs  to  F;  that  is,  to  allow  decompositions  of  the  form  of  Figure  5.2. 

In  this  more  general  case  the  partition  matrix  is  three-dimensional.  Vj  defines 
columns  as  before.  V3  defines  rows  as  before.  V2  defines  the  third  dimension.  That 
is,  for  every  value  of  Vi,  there  is  a  traditional  two-dimensional  partition  matrix.  The 
information  of  V2  is  available  to  F;  differences  in  the  values  of  the  function  across 


Vi 

V2 

Va 


F 

•/(V1,V2,V3) 


Figure  5.2:  Form  of  a  More  General  Decomposition 


68 


X5 

00 

Xi®2 

01  10 

11 

0 

0 

0 

0 

1 

1 

X3  = 

0 

0 

1 

0 

0 

1 

1 

1 

0 

0 

0 

1 

1 

1 

1 

0 

0 

1 

1 

0 

0 

0 

1 

0 

1 

X3  = 

1 

0 

1 

0 

1 

0 

1 

1 

0 

0 

1 

0 

1 

1 

1 

0 

1 

0 

1 

Table  5.11:  Partition  Matrices 


this  third  dimension  do  not  have  to  be  reflected  in  <j>^  that  is,  they  can  be  accounted 
for  in  F.  Therefore,  in  defining  <f>,  it  is  only  necessary  that  each  individual  layer  (for 
each  value  of  Vj)  be  decomposable.  Let  represent  the  column  multiplicity  of  the 
2-D  matrix  for  The  decomposition  condition  therefore  is  <  [Z\  for  all  V2  in 
V2.  If  we  redefine  u  to  be  the  maximum  of  all  for  any  V2  in  V2  then  the  condition 
can  be  given  as  v  <  [Z],  The  V2  variables  are  “shared”  between  <f>  and  F]  thus  we 
sometimes  call  this  generalization  a  “shared  variable”  decomposition. 

As  an  example  of  shared  variable  decomposition  consider  the  function 

defined  in  the  partition  matrices  of  Table  5.11.  The  partition  corresponding  to  Ta¬ 
ble  5.11  is  Vi  =  =  {xa}  >  and  V3  =  {a:4,X5}. 

From  the  partition  maorices  we  see  that  i'xi=o  =  2  and  =  2,  thus  v  =  2.  A 
decomposition  exists  with  respect  to  this  partition  when  Z  =  {0, 1},  e.g.  Figure  5.3. 


5.2.3  The  Formal  Basic  Decomposition  Condition 

lintroduction 

When  discussing  decompositions  of  a  function  it  is  natural  to  classify  different  kinds 
of  decompositions  according  to  their  properties.  An  extensive  taxonomy  of  decom¬ 
positions  is  developed  in  [14]  (e.g.  there  are  simple,  multiple,  iterative,  disjunctive, 
and  proper  types  of  decompositions  as  well  as  their  complements  and  many  of  their 
combinations).  Without  detracting  in  any  way  from  the  importance  of  these  classes, 
we  propose  a  new  class  called  the  basic  decompositions.  This  class  is  defined  to 
correspond  exactly  to  the  class  of  decompositions  which  can  be  tested  for  with  the 
decomposition  condition  to  be  developed  in  this  report. 

We  now  formally  define  a  basic  decomposition.  Let  n  be  a  finite  integer  and 
let  A’i,A'2,...,X„,  and  Y  be  finite  sets,  each  of  cardinality  two  or  more.  We  are 


69 


X1X2 

^xs=0 

^X3=l 

00 

0 

0 

01 

0 

1 

10 

1 

0 

11 

1 

1 

®1®2®3 

000 

0 

001 

0 

010 

0 

on 

1 

100 

1 

101 

0 

no 

1 

111 

1 

Figure  5.3:  Example  Decomposition 


interested  in  any  partial  function  /  of  the  form  f  Xi  x  X2  x  ...  x  X„  Y.  Let 
®  1 ,  *2)  •  •  • » ®n  represent  variables  of  the  sets  ,  Xa, . . . ,  A",*  >  respectively.  That  is,  is 
some  unspecified  element  of  X,-.  Let  Vj,  V2  and  V3  be  products  of  Unite  sets,  say  Vi  = 

Vj j  X  Vi2  X . . .  X  Vi„»,  V2  =  Hi  X  V22  X  •  •  •  X  Hn">  aJid  V3  =  Hi  X  V32  X . . .  X  Hn'">  where 
there  exists  a  bijection  5  ;  {1, 2, . . . ,  n}  -»  {11, 12, . . . ,  In',  21, 22, ... ,  2n",  31, 32, ... ,  3n'"} 
such  that  Xi  —  H(0  for  i  =  1,2, ...,n.  vi,V2,U3  are  variables  associated  with  Vi>H> 
and  H)  respectively.  The  variable  represents  the  variables  v,i,v,-2,...,u,n'.  HjH 
and  H  are  a  partition  of  the  domain  of  /.  By  a  “partition”  we  mean  that  except  for 
possibly  the  order  of  the  sets  in  the  products,  x  ^2  x  . . .  x  =  H  x  H  x  H-' 
Consider  f  i  Vi  x  V2  x  V3  Y  such  that  f{zi,X2y...,x„)  =  /'(vi,V2,V3)  when 
v,’s  and  the  x/s  are  related  by  b.  Rather  than  distinguishing  /  and  we  simply 
use  /  for  either  function;  which  function  is  made  clear  by  the  context.  That  is,  we 
say  /(xi,X2, . . .  ,®n)  and  /(vi,V2,V3),  when  Ui,t;2,V3  is  a  partitioning  of  the  variables 
®ii  ®2»  •  •  •  >  similarly  use  /(uii,ui2,  •  •  •  1121,1122,  <  •  • ,  V2n'',  1^31,  V32,  •  •  •  ,n3n"') 

with  the  obvious  meaning. 

A  basic  decomposition  of  a  function  f  :  Xi  x  X2  x  ...  x  Xn  Y  with  respect  to 
the  partition  H ,  H  and  H  is  two  functions  ^  :V\xV2  Z  and  JP:ZxHxH-»y 
such  that  /(ill,  112,113)  =  F{^{vi,V2),V2yV3)  for  all  v\  e  H,n2  €  H,  and  213  6  H  when 
/(ill,  112,113)  is  defined. 

When  /  is  not  a  total  function,  we  only  require  that  /(iii ,  112,113)  =  i^(^(vi ,  112),  112,  V3) 
when  /  is  defined.  That  is,  ^(^(111,112), 112,113)  may  be  defined  arbitrarily  or  undefined 
whenever  /(ui, 112,^3)  is  undefined.  Our  justification  for  this  comes  from  two  sources. 
First,  when  partial  functions  arise  in  practice,  the  elements  with  undefined  images  ei¬ 
ther  cannot  occur  or  when  they  do  occur  we  do  not  care  what  the  function  outputs  for 
that  input.  Therefore,  allowing  a  decomposition  to  produce  a  value  when  the  original 
function  was  undefined  is  often  acceptable  in  practi-'e.  Secondly,  in  those  cases  where 
we  do  want  the  decomposition  to  be  undefined  whenever  the  original  function  was 
undefined,  we  can  modify  the  semantics  of  the  problem  slightly  and  make  it  a  special 
case  of  what  we  allow.  To  modify  the  semantics  of  the  problem,  we  define  a  total  func- 


70 


tion  f  :  X1XX2X  ...X  Xn  yu{ti}  such  that  /'(xi,a:2, . . .  ,®„)  =  /(a:t,®2»*«*  >®n) 
when  /  is  defined  and  /'(«i,«2} •  •  •  >®n)  =  «  when  /  is  undefined.  The  decomposi¬ 
tion  properties  of  f  with  respect  to  our  criteria  are  the  same  as  the  decomposition 
properties  of  /  when  undefined  values  must  be  preserved. 

Some  of  the  more  familiar  types  of  decompositions  are  special  cases  of  basic  de¬ 
compositions.  For  example,  when  X;  =  {0,1}  for  i  =  1,2, ...,n  and  Y  =  {0,1}'", 
we  have  the  classical  Boolean  functions.  When  V2  is  empty  and  Z  =  {0, 1}  we  have 
a  traditional  “nondisjunctive”  decomposition.  If  Z  =  {0,1}*,  for  some  integer  k,  we 
might  think  of  ^  as  a  vector  function  (or  k  distinct  functions  on  the  same  domain).  In 
this  case  we  have  an  “improper”  decomposition.  A  basic  decomposition  can  be  any 
of  the  Curtis  types  of  decomposition.  Multiple-valued  logic  functions  are  obviously 
covered  by  this  form  as  well. 

The  Basic  Decomposition  Condition  Theorem 

Before  stating  the  basic  decomposition  condition.  We  need  to  develop  a  formal  defi¬ 
nition  of  the  column  multiplicity  (u)  of  a  partition  matrix  given  a  function  /  and 
a  partition  of  its  variables  Vi,V2)V3.  We  assume  that  Vi  and  V3  are  both  non¬ 
empty.  When  either  is  empty,  F  or  ^  is  exactly  /;  therefore,  no  real  “decomposi¬ 
tion”  is  involved.  Since  V3  is  a  finite  set  of  [I3]  elements,  we  can  define  a  bijection 
6  :  {1,2, ... ,  [V3]}  — >  Kt*  a  “column”  for  some  fixed  uj  and  V2  is  defined  as  the 

sequence:  =  (/(vi, ^2, 6(1)), /(i;i,t;2, 6(2)), /(t;i,U2, 6(3)), .••,/(«!, V2,KN)))' 

Therefore,  is  a  vector  from  Columns  form  a  “set  of  columns”  (S^j)  for 

a  fixed  V2  :  S'uj  =  €  Vj}.  When  /  is  total,  u  is  the  maximum  element  of 

{[5u,]1u2  €  V2}i  however,  when  /  is  not  total,  we  need  an  extra  step. 

Call  two  columns  compatible  if  the  only  coordinates  in  which  they  diifer  are  those 
where  either  is  undefined.  Consider  =  {Cvivi\f{‘Vi)V2)V3)  is  defined  for  each 
ui,i;3}.  If  is  empty,  we  can  go  to  later  steps.  If  not,  define  relation  Rv,  on  Sv, 
using  prefix  notation  by  Rv2{Cv^v^,Cv^,v:l)  f{v\,V2,V3)  =  /(vi',V2,V3)  for  all  V3  G  V3. 
This  is  an  equivalence  relation  on  8^^.  Enumerate  the  resulting  equivalence  classes 
Fi,F2,...,F,„^  calling  representatives  ei,e2,...,e^.  Also  enumerate  the  elements 
of  the  set  Suj\5uj,  (71,(72, ...,  Choose  the  first  class  such  that  Ci  is  compatible 
with  the  representative  of  that  class  and  adjoin  Ci  to  that  class.  If  no  such  class 
is  found,  Cl  will  belong  to  its  own  class  If  Ci  creates  a  new  class,  it  is  the 

representative  -  called  .  Otherwise,  define  the  new  representative  of  the  class 
that  Cl  is  in,  to  be  el  where  (e{),„  is  the  value  of  ((7,)^  or  (ek)„,  if  either  or  both 
are  defined  (if  both,  they  must  be  equal),  and  (e{),n  is  undefined  if  both  ((7i),„  and 
(^Jc)m  are  undefined.  Here,  m  =  1  to  [V3]  and  m  stands  for  the  “coordinate”  of  the 
column  vectors.  Representatives  of  other  classes  remain  the  same  but  are  denoted 
individually  as  e}.  Continue  in  this  manner  with  the  other  elements  of  8v^\Sv2.  That 
is,  if  a  column  Ca  is  not  compatible  with  any  of  the  existing  representatives,  it  will 
create  a  new  class  whose  number  is  one  more  than  the  number  of  the  last  class: 
/uj  +  1.  It  will  be  the  representative,  denoted  .  If  a  column  Ca  is  compatible 
with  an  existing  representative,  we  choose  the  first  occurrence  of  this,  adjoin  Ca  to 


71 


that  class  and  proceed  as  with  Ci  except  in  the  above  steps  (when  we  dealt  with 
Cl)  replace  efc  with  with  eg,  and  e|  with  eg.  Finally,  given  V2,  we  have  the 

set  of  classes  E„^  =  <  iu,  <  i^v^}  which  partitions  the  set  of  columns.  The 

classes’  representatives  are  (e'’‘'2)i,(e^'"2)2,...,(e^’‘’2),,„^.  Call  the  equivalence  relation 
determined  by  this  partition  of  Now  we  define  =  [E^j].  When  V2  is  empty, 

there  is  only  one  since  V2  cannot  occur  in  the  expression.  Finally,  we  define  v  as 
the  maximum  over  all  V2  in  V2  of  This  definition  relies  only  on  elementary  set 
theory  for  background. 

\ 

Theorem  5.2  (The  Basic  Decomposition  Condition)  ^  For  finite  integer  n,  and 
finite  sets  Xj  XX2X  ...xXn^Y,  let  f  be  a  partial  function  f  \X\xX2X  ...xXn-* 
Y.  Let  Vi,  14;  ond  V3  be  a  partition  of  the  domain  of  f.  There  exist  functions 
(f>:VixV2  Z  and  F  :  Z  xV^xVs—^Y  such  that,  whenever  f{vi,V2,V'f)  is  defined, 
f{vi,V2,V3)  =  F{<f){vi,V2),V2,v-f)  if  and  only  if  v  <  [Z], 

Proof: 

First  we  prove  that  u  <  \Z]  implies  that  there  exist  functions  <j>  •.  Vi  x  V2  —*  Z  and 
F  :  Z  X  V2  X  V3  Y  such  that,  whenever  /(vi,V2,U3)  is  defined,  f{vi,V2,V3)  = 

F{<f>{vuV2),V2,V3). 

Step  1.  Define  <f> Vi  x  V2  -*  Z  hy  ^(vijUa)  =  i  if  C^ivt  G 
Step  2.  Define  F  :  Z  XV2  xVs  Y  as  follows: 

i.  if  2  =  t  for  some  i,  then  F{i,V2,V3)  =  (e;-"’')6(„j)  where  b  is  the  bijection  from  V3 
into  {1,2,  ...,[14]}  , 

ii.  otherwise,F(2,  ^2,^3)  is  undefined.  Once  in  place,  defined  coordinates  of  the  rep¬ 
resentatives  of  the  equivalence  classes  do  not  change,  so  when  defined  f{vi,V2,V3)  = 

Now  we  prove  that  the  existence  of  functions  (^  :  Vj  x  F2  — >  Z  and  F  ZXV2XV3 
Y  such  that,  whenever /(ui,U2,U3)  is  defined, /(vi,t;2,V3)  =  F((^(vi,V2),V2,V3)  implies 
that  V  <[Z\.  First  observe  that  v  <  [Z]  is  logically  equivalent  to  u„2  <  [Z]  for  all 
1^2  €  V2.  Assume  to  the  contrary  that  there  exists  V2  G  V2  such  that  >  \Z]. 

1)  We  can  assume  that  equivalent  columns  correspond  to  ordered  pairs  which 
have  the  same  inverse  image  under  When  and  Cv^,v2  are  i?yj-related,  then 
when  defined  f{vi,V2,V3)  =  /(vi<,V2,U3)  for  each  U3.  Thus,  F{(f>{vi,V2),V2,V3)  = 
F{<j){vii,V2),V2,V3).  No  harm  is  done  to  the  relationship  between  (f>  and  F  and  the 
range  of  <f>  is  no  larger  than  before.  Hence,  we  assume  that  the  inverse  image  of  an 
element  of  Z  contains  ordered  pairs  which  correspond  to  an  equivalence  class  (under 
i?„, )  of  columns. 

2)  If  >  [Z],  then  there  must  be  two  non-equivalent  columns  whose  corre¬ 
sponding  ordered  pairs  have  the  same  image  under  (j).  These  columns  cannot  come 
from  since  is  the  equality  relation  there  because  if  two  columns  from  5^,  are 
not  equivalent,  there  must  exist  a  V3  such  that  f{v\,V2,V3)  =  f{vii,V2,V3).  Hence, 

‘  Mike  Breen  contributed  substantially  to  the  development  of  this  proof.  See  (4)  for  the  original 
development  of  a  decomposition  condition. 


72 


/(Vl,V2,U3)  =  F{<I>{vi,V2),V2,V3)  =  V2), ^2, U3)  =  f{v[,V2,V-j)  ^  /(V) , Ua, V3), 

a  contradiction. 

3)  For  Sv2\Sv2i  the  ordering  developed  in  defining  Choose  the  first  ele¬ 
ment  of  Sv2\Sv2 )  C'j)  such  that  the  number  of  equivalence  classes  in  8,,.^  U  {C\ , . . . ,  Cj} 
is  larger  than  U  {Ci, . . . ,  Cj})].  If  no  such  element  exists,  <  [Z],  Because  of 
how  Cj  was  chosen,  Cj  must  create  its  own  class.  That  is,  Cj  is  not  compatible  with 
any  representative  of  the  equivalence  classes  at  that  time.  We  can  write  C^,^  „j  for  Cj 
and  see  that  there  exists  Cv^,v2  €  St,j  U{Ci,...,C'j}  such  that  <^('Ui,t;2)  = 

Call  the  representative  of  the  class  of  We  know  that  Cv^v2  is  compatible 

with  e“.  Therefore,  there  is  a  currently  in  the  class  with  Cv^,v2 

some  V3,/(viH,U2,U3)  and  /(vi,U2,U3)  are  defined  and  unequal.  This  last  statement 
follows  from  the  fact  that  a  column  is  compatible  with  a  representative  of  a  class  if  and 
only  if  it  is  compatible  with  each  element  in  the  class.  Now  we  have  ,  Cv^,v2 ) 

which  implies  that  <f){viii,v2)  =  ^(vi',V2)  =  <^(^1,^2)  and  that  F’(^(vi«,U2),U2,U3)  = 
F{<j){v\,V2),V2,V3)  for  each  U3.  Yet  for  some  U3,/(i;i»,U2,U3)  =  /(ui,t;2,U3)  and  both 
are  defined.  As  before,  this  contradicts  the  assumption  about  <f>  and  F.  Hence,  no 
such  non-equivalent  columns  exist. 

□ 


5.2.4  Non- Trivial  Basic  Decompositions 

For  a  mapping  of  a  given  form  /  :  x  ^2  x . . .  x  Y  and  for  a  given  partition  V, 
if  [Z]  is  sufficiently  large  then  every  possible  function  of  that  form  will  decompose  with 
respect  to  that  partition.  We  call  decompositions  of  this  type  trivial.  Decompositions 
which  are  not  trivial  are  called  non-trivial.  Since  non-trivial  decompositions  do  not 
always  exist,  they  are  in  some  sense  special.  This  section  develops  the  condition  for 
the  existence  of  non-trivial  basic  decompositions. 

First  we  establish  the  least  upper  bound  on  u.  Define  to  be  the  smaller  of 
IVil  or  [yll'il. 

Theorem  5.3  IfV  =  {Vj,  14,1^3}  is  a  partition  of  the  variables  of  the  function  f  : 
X]  X  A^2  X  •  •  •  X  Y  then  v  of  f  with  respect  to  that  partition  is  less  than  or 

equal  to  Further,  there  is  no  other  bound  less  than 

Proof: 

First  we  show  that  u  <  [Vj].  [Yi]  is  the  total  number  of  columns  in  the  partition 
matrix,  therefore  the  number  of  distinct  columns  cannot  exceed  this  number.  More 
rigorously,  S  Vj}]  <  [Yi]  for  all  V2,  since  there  must  be  an  element  in 

the  right  hand  set  for  every  element  in  the  left  hand  set.  By  definition  of  then, 
<  [Vi]  for  all  V2  and  by  definition  of  <  (Yi)  for  all  Y2.  Finally,  by 

definition  u,  u  <  [V>\- 

Now  we  show  that  v  <  (V3I  is  the  number  of  rows  in  the  partition  matrix, 

and  the  partition  matrix  contains  elements  from  Y.  Therefore,  each  column  is  a  string 


73 


of  characters  from  Y  of  length  [V3]  and  there  can  only  be  such  strings.  More 
rigorously,  recall  that  =  (/(ui,V2, 6(1)), /(ui,U2,&(2)), •••>/(«!, n2,6([V;»]))).  Let 
A  be  the  set  of  all  possible  Cujuj’s.  C„^^  is  a  string  on  Y  of  length  [V3],  therefore, 
[i4]  =  By  definition  of  5„j,  is  a  subset  of  A,  thus  [S'yj]  <  [^4]  = 

is  a  partition  of  and  does  not  contain  the  empty  set  so  [JS,;,]  <  By 

definition  of  =  [Evj]  <  [5„j]  <  [F]^'^®^.  This  is  true  for  arbitrary  i/„j,  therefore 

u  <  [F]t''®l 

We  have  shown  that  u  <  [Fi]  and  that  u  <  [F]^''®].  therefore  u  is  less  than  or  equal 
to  the  smaller  of  these  two  limits.  That  is,  u  < 

There  is  no  bound  less  than  i/n»ax  since  we  can  always  construct  a  function  of  the 
given  form  with  respect  to  the  given  partition  such  that  the  partition  matrix  contwns 
distinct  columns  up  to  the  Umax  limit. 

□ 


We  can  now  develop  the  sufficient  condition  on  [Z]  such  that  decompositions 
always  exist. 

Theorem  5.4  F  :  2  x  F2  x  F3  -+  F  and  <f)  :Vi  XV2  -*  Z  ia  a  trivial  decomposition 
of  f  >  X\  X  X2  X  ...  X  Xn  -*  Y  with  respect  to  partition  V  if  and  only  if  \Z]  >  j/max- 

Proof: 

First  we  prove  that  [Z]  >  u^ax  implies  that  the  decomposition  is  trivial.  Assume 
that  [Z]  >  Umax.  By  Theorem  5.3,  u  <  Umax]  with  the  assumption,  this  becomes 
[Z]  >  Umax  ^  Finally,  from  the  Basic  Decomposition  Condition,  u  <  [Z],  we  see 
that  decomposition  is  always  possible. 

To  prove  the  implication  the  other  way:  Assume  that  the  decomposition  is  trivial; 
we  want  to  show  that  [Z]  >  Umax-  Suppose  to  the  contrary  that  [Z]  <  Umax'  This  [Z] 
would  be  an  upper  bound  on  u,  since  this  decomposition  is  possible  for  all  functions. 
But  this  bound  is  less  than  i^max,  which  violates  the  Theorem  5.3.  Therefore,  the 
supposition  is  false  and  the  theorem  follows. 

□ 


The  relationship  between  [Fi]  and  u  is  shown  graphically  in  Figure  5.4.  We  use 
the  fact  that  [/]  =  [Fi][F2][V3]  to  plot  1/  as  a  function  of  [F]]  with  [/]  and  [F2]  as  the 
only  parameters. 

The  following  theorem  is  the  non-trivial  basic  decomposition  condition. 

Theorem  5.5  For  finite  integer  n,  and  finite  sets  Xi  x  X2  x  . . .  x  X„^Y,  let  f  be  a 
partial  function  f  :  Xi  X  X2  x  ...  X  X„  —*  Y .  Let  Fi,  F2,  and  F3  be  a  partition  of  the 
domain  of  f.  There  exists  a  non-trivial  decomposition  of  f  consisting  of  the  functions 
^  :  Fi  X  F2  — >  Z  and  F  :  Z  x  F2  x  F3  — »  F  such  that,  whenever  f{vi,V2,V3)  is  defined, 
/(vi,U2,U3)  =  F{<j){vi,V2),V2,V3)  if  and  only  if  U  <  Umax- 

Proof: 

From  the  Basic  Decomposition  Condition  we  have  u  <  [Z]  and  from  the  Trivial 


74 


V  A 


V  =  [Vi] 


•v  =  [Y]M[D(f)]/(lVi)[V2])} 


vmax 


(area  of  possible  v's) 


->  IVl) 


lD(f)]/(V2] 


Figure  5.4:  Relationship  Between  u  and  [Vj],  where  D{f)  =  [/] 


s  — 
R  —  3 
n  —  R 


/(®) 


Figure  5.5:  The  Basic  Decomposition 


Decomposition  Theorem,  we  have  that  a  decomposition  is  non-trivial  if  and  only  if 

[Z]  <  t^rnax< 

□ 


We  now  develop  one  special  case  of  the  above  theorem  in  some  detail.  Assume 
that  /  :  X"  — >  X  and  we  require  that  Z  =  The  Basic  Decomposition  Condition 
becomes  v  <  [A")*.  Define  parameter  s  and  R  such  that  [Fj]  =  [X]*  and  (V2]  = 
The  parameters  n,s,il  and  k  are  the  number  of  variables  in  their  respective 
groups  (see  Figure  5.5). 

In  this  special  case  i/max  becomes  the  smaller  of  [X]*  and  or  equivalently, 

^max  =  min(s,  [X]("~^)).  Therefore,  the  Basic  Decomposition  Condition  is  always 
satisfied  if  3  <  fc  or  [X](”"^l  <  k.  The  special  case  of  this  where  A:  =  1  and  s  =  0  or 
3  =  1  is  the  “trivial  decomposition”  of  Curtis’62. 


75 


®1 

®2 

®3 

/ 

9 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

1 

0 

0 

0 

0 

1 

1 

0 

1 

1 

0 

0 

1 

1 

1 

0 

1 

0 

0 

1 

1 

0 

1 

1 

1 

1 

1 

0 

1 

Table  5.12:  Functions  /  and  g 


5.2.5  Negative  Basic  Decompositions 

In  decomposing  a  function  we  are  often  interested  in  minimizing  the  cost  of  a  real¬ 
ization  of  the  function  based  on  that  decomposition.  We  pursue  two  particular  cost 
functions, 

We  use  r/  to  denote  the  representation  of  /.  DFC{rf)  is  the  decomposed  function 
cardinality  of  that  representation.  DFC{f)  is  the  DFC{rj)  when  r/  is  the  optimal 
representation  of  /.  Similarly,  L{rj)  is  the  program  length  of  r/  and  L{f)  is  the 
optimal  program  length  of  /. 

Our  first  cost  function  L{rj)  =  1  -f  3n  2(n  -f  1)[>1]  -f  n[P]  -f  DFC{f)  =  1  -f 
5n  4-  2(n  +  l)([V'i]  4-  2[V2)  4-  (V3]  4- 1)  4-  {^]iog[Z]  4-  [F]  log[F]  reflects  interconnection 
complexity  and  DFC.  It  is  possible  to  completely  define  any  representation  rj  with 
L{rf)  bits. 

The  second  cost  function  DFC{rf)  =  [^]log[Z]  4-  [F]log[y]  is  a  reflection  of  size 
complexity  only.  However,  it  is  demonstrated  in  Chapter  4  that  size  complexity,  as 
defined  by  DFC,  is  the  principal  component  of  overall  complexity  and  that  DFC  is 
fundamentally  tied  to  time  complexity,  circuit  complexity  and  program  length. 

A  decomposition  is  “negative”  if  its  cost  is  less  than  the  cost  of  the  un-decomposed 
function.  L-Negative  implies  non-trivial. 

Theorem  5.6  Being  non-trivial  is  a  necessary  but  not  a  sufficient  condition  for  a 
basic  decomposition  to  be  negative. 

Proof: 

Non-triviality  is  a  necessary  condition  since  if  a  negative  decomposition  were  triv¬ 
ial  then  every  function  would  have  a  negative  decomposition  and  the  average  min¬ 
imum  cost  would  violate  the  Average-Minimum  Program  Length  Lower  Bound  of 
Appendix  A.  That  non-trivality  is  not  sufficient  can  be  demonstrated  with  an  exam¬ 
ple.  Consider  /  :  {0, 1}^  — »  {0, 1}  defined  in  Table  5.12 

The  partition  VJ  =  {xi,a!2},F2  =  0,  and  V3  =  {xa}  has  ^'  =  2  for  /  and  1/  =  4  for 
g.  f  has  a  decomposition  of  the  form  <f> :  {0,  Ip  — »  {0, 1}  and  F  :  {0, 1}^  {0, 1},  i.e. 


76 


Z  =  {0, 1}.  This  decomposition  is  non-trivifil  since  there  exists  a  function  (namely 
g)  that  does  not  have  this  decomposition  with  respect  to  this  partition.  This  decom¬ 
position  is  also  not  negative.  That  is,  L{rj)  >  DFC{rj)  =  2^  -I-  2^  =  8  which  is  not 
less  than  2^  =  8. 

□ 


Explicit  statements  of  the  various  negative  decomposition  conditions  are  of  interest 
since  they  are  used  in  decomposition  algorithms.  The  following  is  called  the  L- 
Negative  Basic  Decomposition  Condition. 

Theorem  5.7  A  Basic  Decomposition  <j) :  Vi  x  V2  Z  and  F  Z  x  V2  x  V:t  Y  of 
a  function  f  :  Xi  X  X2  X  X  Xn  —*  Y  with  respect  to  partition  V  is  negative  with 
respect  to  L  if  and  only  ifl  +  7n  +  2(n  +  1)([V1]  -t-  2(^2]  +  [V3])  +  [Vl)[V2]log[2]  -t- 
[^][^2l[V^3]log[y]  <  [Vt][V2][V,]loz[Y], 

Proof: 

By  definition  of  a  negative  decomposition:  i(r/)  <  [X]log[y‘]  -t- 1.  The  theorem 
follows  by  substitution  and  simplification. 

□ 


The  next  theorem  is  the  DFC-Negative  Basic  Decomposition  Condition. 

Theorem  5.8  A  Basic  Decompdsition  ^>'.¥1  XV2  -*  Z  and  F  i  Z  xV2xV^—*Y  of 
a  function  f  i  Xi  X  X2  x  ...  X  Xn  Y  with  respect  to  partition  V  is  negative  with 
respect  to  DFC  if  and  only  if  [F,]  log(Z]  +  (ZjfFa]  log(r]  <  {V^][V^]  log[r]. 

Proof: 

By  definition  of  a  negative  decomposition:  DFC{rj)  <  [X]log[y].  The  theorem  fol¬ 
lows  by  substitution  and  simplification. 

□ 


The  largest  [Z]  that  satisfies'  the  above  inequalities  is  the  largest  [Z]  that  will 
yield  a  negative  decomposition  in  the  applicable  situation.  Since  [Z]  must  be  greater 
than  or  equal  to  the  column  multiplicity  u,  the  above  inequalities  give  the  maximum 
u  that  will  result  in  a  negative  decomposition.  Therefore,  when  we  are  exclusively 
interested  in  negative  decompositions,  we  are  only  interested  in  i/’s  which  satisfy  the 
above  inequalities.  That  is,  in  the  DFC  case,  {Fl]logi/-f  i/[V:j]log[y]  <  [FillFi]  log[y]. 
If  ^'max  is  the  largest  integer  that  satisfies  the  negative  decomposition  condition  then 
we  are  only  interested  in  i/’s  which  are  less  than  or  equal  to  Therefore,  when  we 
are  counting  columns  in  a  partition  matrix,  we  can  stop  counting  as  soon  as  v  reaches 
t'inax-  With  respect  to  our  special  case  of  the  previous  section  (i.e.  /  :  X”  X,Z  = 
X'^,Vi  =  X^,V2  =  and  V3  =  the  DFC-Negative  decomposition 

condition  becomes  k[XY  +  <  [X]'K  That  is,  from  the  theorem  above: 

(Vl]log[Z]  -f  [ZmioslY]  <  (F,)[F3]log(y], 


77 


[A:]-iog([j^]*)  +  [xnx]^-^hog[x]  <  [A:]*[jy]("-«)iog[x], 
k[XY log[X]  +  [X](*=+"-«)log[X]  <  [X](”-«+*)log[X], 
k[XY  +  [j5f](*+"-«)  <  {x]("-«+*). 

With  the  further  specialization  that  there  be  no  shared  variables  (i.e.  il  =  s),  the 
condition  becomes  or  <  1. 

If  we  define  k^ax  as  the  largest  integer  satisfying  the  applicable  condition  then 
we  know  that  k  must  be  less  than  or  equal  to  k^n^x  iot  negative  decompositions.  It 
follows  then  that  the  maximum  u  of  interest  is  i/max  = 

The  important  point  is  that,  given  n  and  F,  we  can  directly  determine  the  maxi¬ 
mum  u  for  a  negative  decomposition.  This  test  can  be  useful  since  we  may  want  to 
discontinue  counting  columns  in  evaluating  column  multiplicity  after  we  are  assured 
that  no  negative  decomposition  is  possible. 

We  have  one  final  result  concerning  basic  decompositions.  Let  N  be  the  number 
of  ways  that  a  set  X  can  be  partitioned  into  three  sets  The  number  of 

combinations  of  t  elements  from  the  set  X,  where  X  has  n  elements,  is  n!/(n  —  i)\  = 
C(n,i).  This  takes  care  of  the  first  set  V\.  Now  select  j  elements  from  n  —  i  elements: 
(n  — t)!/j!(n  — i-j)!  =  C{n-i,j)  where  j  elements  go  into  F2  and  (n  —  i—j)  elements 
go  into  V3.  Thus  the  number  of  combinations  for  one  partition  is  C(n,i)C'(n  -  t,;). 

For  each  i  there  are  different  combinations  of  n  -  i  elements,  so  for  all  partitions, 
the  number  of  different  combinations  is: 

N  =  J^f2C{n,i)C{n-i,j) 

«=o  i=o 

=  '£,C{n,i)'^C{n-i,j) 

j=0  j=0 

.=0  j=0  i=0 

Recalling  the  Binomial  Theorem  [41,  p.27]: 

(a  4- 6)"  =  X; 

1=0 

N  =  =  (1  -f  2)"  =  3". 

1=0 

The  number  of  nontrivial  partitions  {N')  can  be  derived  by  the  same  method. 
We  must  make  sure  that  Vi  has  no  less  than  two  elements  and  no  more  than  n  —  2 
elements,  so  i  will  go  from  two  to  n  -  2.  Also  V-z  can  have  zero  elements,  but  no  more 

^This  result  was  developed  by  Tina  Normand. 


78 


than  n  —  i  —  2  elements,  (to  make  sure  V3  has  at  least  2  elements),  so  j  will  .go  from 
zero  to  n  —  t  —  2.  The  result  is: 


n— 2  m—i—2 

=  C{n,i)C{n-i,j) 

i=2  j-0 

f=2  i=o 

n-2 

=  C(n, t)(2'  -  C{n  -  i,n  -  i  -  1)  -  C{n  —  i,n  -  i)) 

i=2 

=  (7(n,i)(2‘ -  (n  -  i)  -  1) 

i=:2 

=  E  -  («  +  1)  E  +  E  iC'(n,i). 

t=2  »=2  1=2 

This  can  be  further  reduced  since  -  1)  [65,  p7i]. 

5.3  The  Ada  Function  Decomposition  Programs 

The  FT  1  function  decomposition  algorithm  takes  a  binary  partial  function  of  the 
form  /  :  {0, 1}”  {0, 1}  and  attempts  to  decompose  the  function  into  components 

(j>  and  F  such  that  /(®,y,2)  =  F{<f){x^y),y,z)  where  x,  y,  and  z  are  vectors.  We  are 
especially  interested  in  decompositions  where  the  size  of  the  decomposition  (i.e.  the 
size  of  F  plus  the  size  of  <f>)  is  less  than  the  size  of  the  original  function.  The  algorithm 
accomplishes  this  decomposition  by  searching  through  all  possible  partitions  of  input 
variables  and  testing  to  see  if  the  function  decomposes  for  that  partition.  The  test  for 
decomposition  and  other  theoretical  aspects  of  function  decomposition  are  developed 
in  the  preceding  section.  If  the  function  does  not  decompose  with  respect  to  a  given 
partition  then  another  partition  is  tried.  If  the  function  does  decompose  then  an 
attempt  is  made  to  decompose  each  of  the  component  functions.  As  the  algorithm 
searches  through  possible  decompositions,  any  decomposition  which  has  lower  cost 
than  any  previous  decomposition  is  recorded.  When  the  search  is  completed,  we  have 
the  lowest  cost  decomposition  (lowest  cost  of  those  considered)  of  the  input  function. 

Several  versions  of  this  function  decomposition  algorithm  were  implemented  in 
Ada  during  the  FT  1  project.  This  section  describes  these  programs.  A  User’s  Guide 
for  the  AFD  program  is  in  Appendix  B.  The  FT  1  function  decomposition  software 
was  written  in  Ada  on  the  VMS  System  running  on  a  Vax  11/780.  This  Vax  is  part 
of  the  Fire  Control  Simulation  (FICSIM)  Facility  of  WL/AART  located  in  Building 
22  at  Wright  Fatterson  AFB,  Ohio. 


79 


DECOMP.RECORD 

-  A.TABLE  - >  TABLE_RECORD 

-  VARIABLES  — >  LABELS  eurray  of 

-  NATURALS 

-  A_FUNCTION  - >  TABLE  array  of 

-  ENTRIES: 

(T,F,TBD) 

-  A.PARTITION  - >  PARTITION_RECORD 

-  COLUMN.VARIABLES  - >  LABELS. 

-  ROW.VARIABLES  - >  LABELS. 

~  NEW.VARIABLES  >  LABELS. 

-  A.PARTITION.MATRIX  — >  PARTITION.MATRIX  array  of 

-  COLUMNS. 

-  FUNCTIONS  - >  TABLE.RECORD. 

-  UNiqUE.COLUMNS  - >  TABLE. 

-  MAX_NU:  INDEX;  INTEGER 

-  NU:  INDEX. 


-  LEFT_DECOHP  - >  DECOMP.RECORD. 

-  RIGHT.DECOMP  - >  DECOMP.RECORD. 

.  CHAIN.DECOMP  — >  DECOMP.RECORD. 

-  DECOMP.COST;  NATURAL 


Figure  5.6:  DECOMP-RECORD  Data- Structure 

5.3.1  Program  Functional  Description 

DECOMP-RECORD  is  the  principal  data  structure  for  the  function  decomposition 
algorithm.  The  structure  of  a  DECOMP-RECORD  is  shown  in  Figure  5.6.  Arrows 
in  Figure  5.6  indicate  a  pointer  type  and  the  type  being  pointed  to.  Record  and  array 
components  are  preceded  by  a  dash  and  are  indented  under  the  appropriate  type. 
Names  followed  by  a  period  have  additional  structure  but  that  structure  is  shown 
elsewhere  in  the  figure. 

After  inputting  a  function,  the  algorithm  runs  a  FIND-LOWEST-COST  routine 
and  outputs  the  results.  Figure  5.7  is  a  flow  chart  for  the  FINDJ  OWEST-COST 
routine. 

Figure  5.8  is  a  psuedo-code  representation  of  FINDXOWEST-COST  and  its  prin¬ 
cipal  component  DECOMPOSE-CURRENT. 

Figure  5.9  shows  examples  of  the  main  data  objects  at  various  stages. 


80 


81 


Find  Lowest  Cost  Decomposition 
Algorithm 

If  the  CUKKiiiN'^  Decomposition  Record  Pointer  is  null 
then  return 
else 
Begin 

SetBEST=CURRENT 

Repeat 

Generate  next  combination  of  variables 
if  CURRENT  decomposes  then  DECOMPOSE  CURRENT 
FIND  LOWEST  COST  of  left,  right,  and  chain  functions 
Calculate  cost  of  CURRENT 

if  cost  of  CURRENT<  cost  of  BEST  then  BEST:=CURRENT 
Until  all  combinations  of  variables  have  been  checked  . 

End 

t 

Figure  5.8:  FIND-LOWEST-COST  Psuedo-Code 

5.3.2  Program  Software  Description 

The  main  program  for  the  AFD  algorithm  is  the  procedure  PBML-DRIVER.  Proce¬ 
dure  PBML-D RIVER  uses  five  packages  of  software.  Package  PBML.TYPES  defines 
the  data  structures  and  has  no  body.  Package  PBML_FREE  has  14  procedures  for 
deallocating  pointers.  Various  utility  subprograms  are  in  package  PBML-UTIL,  which 
contains  four  procedures  and  ten  functions,  and  PBML.PACKAGE,  which  contains 
11  procedures  and  three  functions.  The  input  and  output  routines  are  in  PBMLJO. 
PBML  JO  contains  11  procedures.  The  AFD  software  is  contained  in  nine  files.  There 
are  approximately  1,500  lines  of  code. 

Figure  5.10  represents  the  compilation  dependencies  of  the  AFD  software.  The 
arrow  means  “is  dependent  upon.”  For  example,  PBMLJO  should  be  compiled  after 
PBML-UTIL.  After  all  files  have  been  compiled,  PBML-DRIVER  should  be  ACS 
LINKED  and  RUN. 

5.3.3  Versions  of  the  AFD  Algorithm 

Ten  version  of  the  AFD  program  were  implemented.  We  developed  all  these  versions 
of  the  AFD  algorithm  in  hopes  of  finding  two  algorithms.  We  had  hoped  to  find  a 
non-exhaustive  optimal  algorithm.  That  is,  an  algorithm  that  always  finds  the  lowest 
possible  cost  and  does  so  without  considering  all  possible  decompositions.  We  are 
doubtful  that  we  succeeded  in  this.  We  had  also  hoped  to  find  an  algorithm  on  the 
knee-of-the-curve  of  run-time  versus  decomposition  cost.  We  found  that  even  our 
fastest  versions  were  able  to  almost  fully  decompose  most  functions. 


82 


I:;  ;::n| 

l:::]cxjOLjnl 

l:::o:jc:nno| 


inaucJGDDl 

InaDCUHul 

Innaronaol 

|f]oi]nLo:]ul 

lunnaijaa! 

liiaoaaDDl 

ItlUUUUDOl 


□lillJ 

HmI 


Pn«n:;::n:r;ci:nnniitinnii| 
(F;«::.:::::nnnn:;::;::unntil 
l2ali;ii:;:iiiiK::;no:::;i)iK:nun| 

""irSSSSSSSSSSSaaiBail 


N*IM«  •!  MIM  at  ttW 


|K»::::z::]u::::::3onDODunD| 
lf;»::::::::n(i{jn:jDUuUiinal 
.  ^|t.»::::[in:j::oo::L:nuc:nno| 
|llli:«c:u:^UL:u::i)r:uuoL:cKaa| 


1M  laM*  ralUM  VM  Mm^Maly  Mltnad  ky 
Wk*l*  and  IMiciMk  TMy  «m  da  found  by 
■ytfoUng  bn  th*  Inbtfo  and  bfolclM  raihbr 
man  •pbemMlIy  uulni  blanry  enlumna 
aiaund.  Thua,  lha  ra  and  I’a  ahawn  ban  an 
nauar  actually  atarad. 


pee] 

iHnco^al 

Inanni 

Innnni 

■[oroani . 

iHuatail 

liJDQaBl 

innoml 

■Hnaai , 

iDUQDal 

Inuaoil 

luHaot 

InnnQi 

lunaDBl 

liinncjil 

lanuai 


In  H 
In  □ 
In  □ 
In  n 


10  n 
III  0 
|u  0 
|o  0 


□ool 
anal 
oaal 
nnal 
coani 
annl 
|h| 

BOu 

ana 

aaa 

COBU 
BO  a 
ana 
aaa 


|f3innBQBHc;naaaauunal 
|r.iiBBBBniinnnnnBaaaii| 
iQQli^i^i^’BariMBanMnaaBBaal 
lOulmBuuunuBannniinanal 
InnHBnuauBaanDiiaBiiaaul 
IniiMciinnnnnniinnnfinnnnl 
lunafoitiiianMiiiiniiiinnanual 
lauBaiufiaBuuuauiinaaaual 
luaaHaaaaaaaaa 
Inaaaaaaaaaaaa 


C^ika-Partttlon-matrlxl^ 


Maakad  alamanfo  an  turnad  lata 
raw  Ifldleaa  at  tba  iwrtlllan 
matrli.  Unmaakad  alamanta  IMa 
ealumn  Indiclaa.  avnctlan  yaluaa 
an  caalad  Info  parllllan  matrli. 


fnni 

In  n  nnnl 


■n  n 
|n  n 
n  H 


naa| 

annl 

iiiial 


|n  n  nna| 
In  n  niinl 
Li  n  iinBl 


10  0 
0  a 


nil  01 
III'IBI 

|U  11  110111 


iKiinnnnnnnnnoaDDnooI 
IriinnnnaanntuKiBaaaal 
Ir.tinniiuunaiinnuuBBoal 
liicinuBananuBanuBaBal 
■nonaBnanaaaBaaBal 
iiinnniinioannnunBnBl 
iniiuLinannaannaanal 
■uuuBqaaaaanaaaoal 
ir.nioiotonnHBHBBDBBBnBl 
liunauBBaaBauBnaanul 


|a  a 
|a  0 
lu  0 


aartluan.la_Oacamaaaabla 


Caunt  tba  uniqua  ealumna,  H  tha 
numbar  la  toaa  than  lha  numbar 
at  ealumna,  than  It  dacampaaaa. 
Cnala  anaugh  naw  vtrlablaa  la 
hava  Mnary  labala  tar  aaeh 
uniqua  ealumn. 


icQBBBBBBBBaaaoaoaa 
lETiBBBBaaaoBBBBaaaa 
|□□|aJBBauBBaaBBaaBBaa 
lUulcaBaQaBaBOBaBaBaBa 
iBBHHBaaaBaaBDaaBaoBa 
Ibd— aBBBnBBaBBBaaBaBl 
Ian— BaaBBoanoaBBglnal 
laa—iaaoBanaaDaBolIaao 
lajHBBnBOBOBBOBOBBBl 
lajBaDBBaaBnaBBaaBal 


Map  lha  ealumna  of  lha  parllllan 
malrix  Info  lha  naw  varlablaa. 
Put  thaaa  In  right  link  and  aa 
many  chain  linka  aa  naeaaaary. 


Map  lha  roara  and  naw  varlablaa 
Info  lha  arlglnal  tuncllan.  Pul 
In  lalt  link. 


InBBBB 

iBBBan 

Ibbubb 

iBBaan 

Ibobbb 

<BanaB 

iBOOBB 

iBaaoB 

laUBBB 

laBOan 

lanaBU 

lanauB 

loaBBQ 

lannaB 

laaoBB 

loanaQ 


easBa 

onnan 

OBBOa 

nnana 

OBOaB 

BUBBB 

BOBOa 

□aana 

□oaoB 

oonna 

acnoa 

unann 

OBOUB 

aanno 

OUBOO 

OaOBB 

anaoa 


ma 

bbobbI 

QBBanI 

nnnnal 

BBOO 
BUBBOl 
BOBaBI 
naoBBl 
Baoo  I 

tlBBBnl 
OBBaal 
aniiBBi 
OBDO  I 
aoBaal 
auBual 
oaoBBi 
aaoa 


|i  '‘-.11' V'.j 


li 

ITtRljIIHI 


IT1T53HI 


Organization  of  PBML  Code 

procedure  PBML_DRrVER 

I 

V 

package  PBML_PACKAGE 
I 

V 

package  PBMLJO 
I 

V 

package  PBML.UTIL 
I 

V 

package  PBML  FREE 
I 

V 

package  PBML_TYPES 


Figure  5.10:  Compilation  Dependencies 
Search  Constraints  Common  to  All  Versions 

An  exhaustive  search  approach  to  function  decomposition  becomes  intractable  for 
functions  on  more  than  two  or  three  variables.  Therefore,  it  is  necessary  to  limit 
the  search.  There  are  some  search  Umits  that  all  the  versions  have  in  common.  For 
a  given  partition  of  variables,  the  basic  approach  of  each  version  is  to  compute  the 
column  multiplicity  u  of  the  function  with  respect  to  that  partition.  The  value  of 
u  is  then  compared  to  a  threshold  that  is  version  dependent.  If  the  threshold  is 
exceeded  then  this  decomposition  is  dropped  from  further  consideration.  The  idea 
is  that  we  only  want  to  pursue  decompositions  that  are  reducing  cost.  It  is  in  how 
much  of  a  reduction  we  want  or  how  we  measure  cost  that  the  versions  differ.  For  our 
highest  threshold  (i.e.  most  exhaustive  search)  we  know  of  no  functions  where  a  more 
exhaustive  search  would  produce  a  lower  overall  cost  decomposition.  However,  for 
our  other  thresholds  we  know  that  some  desirable  decompositions  are  being  dropped. 
Using  this  threshold  significantly  reduces  the  search  space;  since  otherwise  every 
partition  yields  a  decomposition  whose  children  must  be  decomposed  before  we  know 
that  it  will  not  result  in  an  lower  overall  cost. 

When  the  threshold  is  not  exceeded  and  the  decomposition  is  pursued,  the  first 
step  is  to  form  the  children  of  the  decomposition,  F  and  (f>.  All  the  versions  form 
F  and  ^  by  using  the  binary  equivalent  of  an  enumeration  of  the  columns  generated 
in  counting  up  the  column  multiplicity.  This  simple  approach  to  defining  F  and  <f> 
substantially  reduces  the  overall  search  space.  There  are  typically  many  different  F\ 
and  <f>'s  for  a  given  decomposition.  At  least  in  some  cases,  how  F  and  ^  are  defined 
affects  the  decomposability  of  F  and  <f)  and,  consequently,  the  eventual  cost  of  the 
overall  decomposition.  Therefore,  to  be  exhaustive,  we  must  assess  the  decompos¬ 
ability  of  all  the  different  possible  values  for  F  and  (f)  before  arriving  at  a  specific  F 


84 


and  <f>.  None  of  the  AFD  versions  did  this. 

Search  Constraint  Differences  Between  Versions 

There  are  two  classes  of  variable  partitions  used  by  the  AFD  algorithms.  Version  1 
and  all  the  2  versions  (i.e.  2,  2a,  2b,  2ab)  partition  the  variables  into  two  disjoint  sets. 
One  set  is  input  into  F  and  the  other  into  <j).  Version  3  and  aU  the  4  (i.e.  4,  4a,  4b, 
4ab)  versions  partition  the  variables  into  three  disjoint  sets.  One  set  is  input  into  F 
only,  the  second  set  is  input  into  (f>  only,  and  the  third  set  is  input  to  both  F  and  (j). 
We  call  this  latter  class  of  partitions  the  “shared  variable”  class  while  the  first  class 
is  called  “no-shared  variables.”  When  only  no-shared  variable  partitions  are  used  in 
a  search  it  is  possible  for  the  specific  values  assigned  to  F  and  (j)  to  affect  the  cost 
of  the  overall  decomposition.  We  know  of  no  cases  where  the  method  of  assigning 
values  to  F  and  (j)  affects  the  overaE  decomposition  when  shared  variable  partitions 
are  used  in  the  search.  It  is  possible  that  one  has  an  alternative  between  searching 
through  values  for  F  and  <f)  and  searching  with  shared  variable  partitions;  with  either 
approach  resulting  in  optimal  decompositions. 

The  “a”  versions  (i.e.  versions  2a,  2ab,  4a  and  4ab)  differ  from  the  other  versions 
in  that  they  are  “greedy.”  The  “a”  versions  search  through  partitions  in  order  of 
increasing  numbers  of  variables  input  into  When  a  cost  saving  decomposition  is 
found  for  some  number  of  input  variables  into  <f)  then  partitions  with  a  larger  number 
of  variables  input  into  <j}  are  not  considered.  That  is,  once  a  decomposition  is  found 
for  some  number  of  variables  into  (f)  we  pursue  that  decomposition  but  do  not  backup 
to  consider  larger  numbers  of  inputs  into  The  idea  here  is  that  we  want  to  break 
the  original  function  into  pieces  that  are  as  small  as  possible.  Therefore,  when  we 
have  succeeded  in  breaking  out  a  small  piece,  we  do  not  worry  about  trying  to  break 
out  a  larger  piece.  The  “a”  versions  run  substantially  faster  than  the  other  versions 
and  perform  only  slightly  worse  in  terms  of  the  cost  of  the  decompositions  produced. 

As  discussed  previously,  the  AFD  algorithms  compare  the  cost  of  a  candidate 
decomposition  to  a  threshold.  There  are  two  methods  for  computing  the  cost  of  a 
candidate  decomposition.  The  “b”  versions  (i.e.  version  2b,  2ab,  4b,  4ab)  compute 
cost  based  on  the  cardinality  of  the  components  (DPFC).  The  other  versions  compute 
cost  based  on  the  number  of  variables  input  into  each  component  of  the  decomposition 
(DFC).  That  is, 

DPFC  =  j:  Ik) 

PiS/’ 

and 

DFC  =  Y,  2"'- 

When  the  components  are  total  functions  the  two  costs  are  the  same.  However, 
when  a  component  is  a  partial  function,  which  can  occur  even  when  the  input  is  a 
total  function,  DPFC  is  less  than  DFC.  Since  candidate  decompositions  are  pursued 
whenever  the  cost  is  less  than  the  threshold,  “b”  versions  will  in  general  conduct  a 
larger  search. 


85 


Version  1  ;  NU.MAX  =  min(NU_FEATURE,NU_LUB) . 

Version  2  :  NU.MAX  =  inin(NU_NEG_ALL_TOTAL,NU_LUB) . 

Version  2a  :  NU.MAX  =  min(NU_NEG_ALL_TOTAL,NU_LUB) . 
Version  2b  :  NU.MAX  =  min(NU.NEG,NU.LUB) . 

Version  2ab:  NU.MAX  =  min(NU.NEG,NU_LUB) . 

Version  3  :  NU.MAX  =  inin(NU.FEATURE,NU.LUB) . 

Version  4  :  NU.MAX  =  min(NU.NEG.ALL.TOTAL,NU.LUB) . 

Version  4a  :  NU.MAX  =  min(NU.NEG.ALL.TOTAL,NU.LUB) . 
Version  4b  :  NU.MAX  =  min(NU.NEG,NU.LUB) . 

Version  4ab:  NU.MAX  =  inin(NU.NEG,NU.LUB) . 


Figure  5.11:  NU.MAX  for  Each  Version  of  the  AFD  Algorithm 

Finally,  the  versions  differ  in  the  threshold  they  use  to  evaluate  candidate  decom¬ 
positions.  In  essence,  versions  1  and  3  set  the  threshold  such  that  decompositions 
are  pursued  if  they  are  “featured.”  A  decomposition  is  featured  if  <f)  has  fewer  output 
variables  than  input  variables  (c.f.  (48,  pp.64-66]).  The  “2”  and  “4”  versions  (i.e.  ver¬ 
sions  2, 2a,  2b,  2ab,  4, 4a,  4b,  and  4ab)  set  the  threshold  such  that  decompositions  are 
pursued  if  they  result  in  a  cost  reduction  (what  we  call  a  negative  decomposition).  In 
general  there  are  many  more  featured  decompositions  than  negative  decompositions 
of  a  given  functions.  Therefore,  the  featured  based  versions  are  much  more  exhaustive 
than  the  other  versions. 

NU.MAX  is  defined  in  Figure  5.11.  The  various  NU^s  are  defined  as  follows 
where  s  is  the  number  of  inputs  to  ^  only  and  R  is  the  total  number  of  variables 
input  to  (f>  (including  shared  variables). 

•  NU.LUB  is  the  least  upper  bound  on  u  for  a  given  partition  of  variables. 
NU.LUB  is  Tnin(2*,2'^"“^). 

•  NU.NEG  is  the  u  such  that  if  >  NU-NEG  then  no  negative  decomposition 
exists  for  the  partition  being  considered.  NU.NEG  is  the  largest  u  such  that 
[(log(i/))2'^  -b  1/2'-*]  <  2". 

•  NU.NEG JN PUT. TOTAL  is  the  u  such  that  if  the  input  function  is  total 
then  u  <  NU.N  EG. IN  PUT. TOTAL  implies  that  a  negative  decomposition 
necessarily  exists.  This  was  used  in  the  early  b  versions  with  unintended  results. 
N  U. N  EG  JN  PUT. TOTAL  is  the  largest  u  such  that  [(log(i/))2^  -b  1/2'-*]  < 
(/]. 

•  NU .N EG ~ALL.TOT AL  is  the  i/  such  that  if  all  functions  involved  are  to¬ 
tal  then  u  <  NU.NEG.ALL.TOTAL  implies  that  a  negative  decomposi¬ 
tion  necessarily  exists.  NU.NEG.ALL.TOTAL  is  the  largest  v  such  that 
k2'-"  -b  2''-*  <  1. 


86 


•  NU. FEATURE  is  the  u  such  that  u  <  NU. FEATURE  implies  that  a  fea¬ 
tured  decomposition  necessarily  exists.  NU. FEATURE  is  2*“‘. 

NU -MAX  is  used  to  halt  the  counting  of  columns  in  determining  column  multi¬ 
plicity  and,  when  v  does  reach  NU.MAX  we  know  that  we  do  not  want  to  pursue  this 
partition  of  variables  any  further.  However,  in  general,  just  because  v  is  less  than 
NU  -MAX  does  not  ensure  us  that  this  partition  will  be  a  desired  decomposition. 
Therefore,  it  is  necessary  to  go  ahead  and  form  the  decomposition  {F  and  but  do 
not  try  to  decompose  F  or  <f))  and  determine  its  cost  before  deciding  whether  or  not 
to  include  it  in  the  current  decomposition  tree.  We  think  it  turns  out  that,  except 
for  the  b  versions,  v  less  than  NU-MAX  does  ensure  us  of  a  desired  decomposition. 

The  relationship  between  the  search  space  of  the  diiferent  versions  is  V2a  C  F2  C 
V2b  C  VI  C  V3,  Via  C  Vi  C  Vib  C  VZ  and  V2i  C  Vii  for  i  =  a,6,o6  or 
blank.  Versions  2a  and  4a  were  used  for  most  of  the  experimental  work  described  in 
Chapter  6.  Unlike  all  the  other  versions,  we  found  no  functions  that  version  3  did 
not  find  the  best  known  representation.  Therefore,  version  3  can  not  be  ruled  out  as 
a  possible  optimal  algorithm. 

In  summary,  there  are  essentially  three  dimensions  in  the  AFD  version  “space.” 

•  1.  Greedy  and  2.  Not  Greedy. 

•  1.  Not  Shared  and  2.  Shared. 

•  1.  Negative  DPFC,  2.  Negative  DFC  and  3.  Features  required. 

Thus,  a  point  in  this  space  (e.g.  (1,2,1))  corresponds  to  a  version  of  the  AFD  algo¬ 
rithm;  in  particular: 

Version  1  is  (2,1,3). 

Version  2  is  (2,1,1). 

Version  2a  is  (1,1,1). 

Version  2b  is  (2,1,2). 

Version  2ab  is  (1,1,2). 

Version  3  is  (2,2,3). 

Version  4  is  (2,4,1). 

Version  4a  is  (1,2,1). 

Version  4b  is  (2,2,2). 

Version  4ab  is  (1,2,2). 

If  version  i  has  coordinates  (a,-,6,-,c,)  ,  version  j  has  coordinates  {aj,bj,Cj)  and  i  ^  j 
then  the  search  space  of  version  j  is  a  proper  subset  of  the  space  of  version  i  whenever 

Ct,"  ^  ^ j"  * 

For  example,  version  4b  =  (2,2,2)  does  a  larger  search  than  version  4  =  (2,2,1). 
These  relationships  are  summarized  in  Table  5.13.  Note  that  we  cannot  draw  conclu¬ 
sions  from  this  about  the  relative  size  of  the  searches  of  some  versions,  e.g.  version 
2ab  versus  4a.  Note  also  that  we  did  not  implement  versions  corresponding  to  (1, 1,3) 
or  (1,2,3)  because  the  feature  based  versions  were  so  slow. 


87 


1.  Not  Shared: 

1.  Neg  DFC 

2.  Neg  DPFC 

3.  Featiired 

1.  Greedy 

2a 

2ab 

- 

2.  Not  Greedy 

2 

2b 

1 

2.  Shared: 

1.  Neg  DFC 

2.  Neg  DPFC 

3.  Featured 

1.  Greedy 

4a 

4ab 

- 

2.  Not  Greedy 

4 

4b 

3 

Table  5.13:  APD  Algorithm  Version  Space 

5.4  Ada  Function  Decomposition  Program  Per¬ 
formance 

We  are  interested  in  how  well  the  AFD  algorithms  decompose  functions  and  in  how 
long  the  decomposition  takes.  This  section  reports  on  the  results  of  several  experi¬ 
ments  to  assess  the  algorithms’  performance. 

We  used  the  various  versions  of  the  Ada  Function  Decomposition  program  on 
VAX  and  MICROVAX  computers  to  decompose  well  over  1000  different  functions, 
ranging  in  size  from  4  variables  to  10  variables,  ranging  in  cost  complexity  from  0 
percent  (most  patterned)  to  100  percent  (completely  unpatterned  or  ‘random’)  and 
ranging  in  number  of  cares  from  5  to  100  percent.  Since  many  different  experiments 
had  been  performed,  there  was  adequate  data  to  draw  some  conclusions  about  the 
relative  performance  of  the  different  versions  of  the  algorithm  in  terms  of  both  cost 
reduction  and  run-time.  Two  subsets  of  data  were  extracted  from  the  PT  1  data  base 
and  some  statistical  analysis  was  performed  on  them. 

The  first  subset,  ‘Set  A,’  was  composed  of  the  output  for  aU  functions  that  had 
been  decomposed  by  all  ten  of  the  versions  of  the  program.  There  are  64  functions  in 
this  category.  The  second  subset,  ‘Set  B,’  was  composed  of  the  output  for  all  functions 
that  had  been  decomposed  by  every  version  of  the  program  with  the  exception  of 
version  3.  Using  version  3  to  decompose  functions  on  six  or  more  variables  generated 
run-times  that  were  far  too  great.  Therefore,  because  Set  A  excluded  all  functions 
that  were  not  decomposed  using  version  3,  it  necessarily  excluded  all  functions  on 
more  than  five  variables  (with  the  exception  of  two  very  simple  functions).  Set  B  was 
formed  so  that  we  could  compare  the  other  nine  versions  on  functions  of  larger  size. 
Set  B  contained  119  functions.  Set  A  is  a  subset  of  Set  B. 

The  first  14  functions  in  Set  A  were  from  a  set  of  ‘trick  functions’  that  was  con¬ 
structed  to  test  certain  aspects  of  the  various  algorithms.  This  included  the  ‘checker¬ 
board  function’  on  five,  six,  seven  and  eight  variables  ;  four  different  functions  whose 
optimal  decompositions  included  shared  variables;  and  several  functions  that  could 


88 


not  be  decomposed.  The  last  50  functions  of  Set  A  were  all  randomly  generated 
functions  on  five  variables. 

Set  B  included  all  the  functions  in  Set  A.  In  addition,  it  included  images  of 
the  letters  R  and  A,  and  several  test  functions  on  six  and  seven  variables  that  were 
designed  either  to  decompose  in  certain  irregular  ways  or  to  be  non-decomposable.  It 
also  included  a  group  of  50  random  six- variable  functions. 

To  summarize,  the  functions  in  Sets  A  and  B  ranged  from  the  highly  patterned 
checkerboard  functions  to  the  very  unpatterned  ‘nodecomp’  functions,  but  the  great 
majority  of  them  were  unpatterned  randomly  generated  functions;  they  ranged  in 
size  from  four  variables  to  eight  variables  but  the  majority  of  them  were  either  five 
or  six  variable  functions;  and  all  of  them  without  exception  were  total  (i.e.  none  of 
them  had  any  ‘don’t  care’  conditions).  In  addition  to  these  sets  where  comparisons 
were  made  on  average,  there  are  some  individual  functions  whose  decomposition  gives 
some  insight  into  the  algorithms’  performance. 

The  following  two  sections  consider  the  relative  performance  of  the  various  versions 
in  the  individual  areas  of  cost  reduction  and  run-time. 

5.4.1  Cost  Reduction  Performance 

Set  A  and  Set  B  Comparisons 

Cost  reduction  is  the  primary  aim  of  function  decomposition.  The  driving  purpose 
behind  each  of  the  versions  of  the  program  is  to  find  a  low  cost  representation  of 
any  given  function,  if  one  exists,  and,  hopefully,  to  find  the  lowest  or  “optimal” 
representation  of  the  function.  While  we  cannot  guarantee  that  any  decomposition 
found  is  truly  optimal  (except  in  a  few  cases  that  are  amenable  to  theoretical  analysis) 
the  data  that  we  collected  have  shown  that  all  the  versions  of  the  program  are  able 
to  find  relatively  low  cost  representations  for  most  functions  that  are  decomposable. 
For  functions  that  are  highly  patterned  and  have  theoretical  optimal  representations, 
all  of  the  versions  are  able  to  find  the  optimal  representations.  For  functions  that  are 
more  complex,  the  different  versions  vary  somewhat  in  performance  with  the  more 
complete  searches  generally  finding  lower  cost  representations.  The  results  of  the  Set 
A  comparison  bring  out  this  difference  primarily  with  respect  to  version  3  on  five 
variables  or  less.  The  results  on  Set  B  show  this  difference  on  the  other  versions. 

Version  3  may  find  optimal  representations.  No  other  version  was  ever  able  to 
find  any  representation  of  a  function  with  a  lower  cost  than  was  found  by  version  3, 
(although  they  were  often  able  to  find  a  representation  with  the  same  cost),  nor  were 
we  able  to  decompose  any  functions  by  hand  to  a  lower  cost. 

The  algorithms  that  do  not  directly  consider  shared  variables  occasionally  found 
a  shared-variable  decomposition  through  the  creation  of  a  new  variable  and  a  table 
of  zero  cost. 

The  versions’  cost  reduction  performance  is  shown  in  Table  5.14.  Probably  the 
most  important  thing  to  note  from  this  is  that,  in  relative  terms,  the  difference  in  cost 
reduction  performance  from  the  least  exhaustive  algorithm  (version  2a)  ‘  o  the  most 


89 


Version 

Set  A  DFC 

Set  B  DFC 

3 

15.93 

•NA 

1 

16.19 

38.34 

4 

16.38 

38.35 

4a 

16.38 

38.35 

4b 

16.38 

38.39 

4ab 

16.69 

38.52 

2b 

17.00 

38.69 

2 

17.06 

38.72 

2a 

17.06 

38.72 

2ab 

17.12 

38.69 

Table  5.14:  Average  DFC  for  Set  A  and  Set  B 

exhaustive  algorithm  (version  3)  is  quite  small.  Other  data  suggested  that  version  2a 
was  particularly  likely  to  do  nearly  as  well  as  version  3  when  the  function  being  decom¬ 
posed  was  very  patterned.  Since  most  of  the  functions  that  were  decomposed  during 
the  remainder  of  the  phenomenology  study  fell  into  the  highly  patterned  category, 
this  result  was  one  of  the  things  that  influenced  us  to  use  the  ‘greedy’  algorithms. 

We  not  only  wanted  to  compare  the  decomposition  performance  between  the  ver¬ 
sions,  we  also  would  like  to  know  how  close  the  algorithm  was  doing  relative  to  the 
best  possible  decomposition.  Some  of  the  functions  that  we  ran  have  well  known 
representations  (e.g.  addition,  parity,  palindromes,  functions  with  only  one  minority 
element).  All  AFD  versions  found  the  expected  decompositions  for  these  functions. 

K-Clique  Function  Example 

There  were  some  cases  where  version  2a  performed  poorly.  However,  these  were  runs 
involving  large  functions  (n  =  10)  where  the  algorithm  was  not  allowed  to  run  to 
completion.  For  example,  the  3-cHque  function  on  a  5-node  graph  is  a  function  with 
10  variables.  Version  2a  was  allowed  to  run  on  this  function  for  about  170,000  seconds. 
The  best  decomposition  found  to  that  point  had  a  DFC  of  360  or  35.2  percent.  Version 
4a  found  a  164  DFC  or  16.0  percent  decomposition  at  65,000  seconds.  We  know  this 
function  has  a  116  DFC  or  14.5  percent  decomposition  using  Savage’s  sum-of-products 
form.  Therefore,  when  the  algorithms  are  not  allowed  to  run  to  completion,  version 
4a  can  substantially  outperform  version  2a  in  a  given  amount  of  time. 

Decomposition  of  Neural  Net  Like  Functions 

The  AFD  program  tries  to  decompose  functions  by  breaking  out  one  piece  at  a  time. 
We  wonder  whether  or  not  there  are  some  decompositions  that  cannot  be  found 
this  way?  Neural  Nets  have  an  architecture  that  has  a  low  Decomposed  Function 


90 


Figure  5.12:  Neural  Net  Gross  Architecture 


Architecture 

(/] 

NN-DFC 

6-2 

64 

44 

8-2 

256 

60 

8-4 

256 

124 

Table  5.15:  DFO  of  NN  Like  Architectures 

Cardinality  (DFC)  but  are  not  in  a  form  that  AFD  would  ever  generate.  Neural  Nets 
have  a  gross  architecture  like  Figure  5.12.  Each  of  the  boxes  in  the  first  layer  have 
all  n  variables  as  input.  Neural  Nets  have  low  computational  complexity  because  the 
function  in  each  box  is  very  patterned.  In  real  Neural  Nets  the  function  is  a  sum  of 
products  and  perhaps  a  threshold.  For  comparison  purposes  we  can  let  these  boxes 
have  a  function  of  minimum  cost  (that  is  minimum  among  functions  without  vacuous 
variables).  Each  box  in  a  Neural  Net  has  an  architecture  bke  Figure  5.13.  Therefore, 
each  box  has  a  DFC  of  4(n  —  1),  where  n  is  the  number  of  input  variables.  Consider 
three  specific  Neural  Net  architectures.  These  architectures  are  identified  as  a  —  6, 
where  a  is  the  number  of  variables  and  6  is  the  number  of  nodes  in  the  first  layer. 
Figure  5.14  shows  the  6-2  architecture.  The  DFC’s  of  these  architectures  are  equal  to 
6(4(o  —  1))  -f  4(6  —  1),  see  Table  5.15.  Most  other  architectures  on  a  small  number  of 
variables  have  a  DFC  which  exceeds  [/].  Although  we  know  that  the  AFD  program 


91 


92 


Table  5.16;  AFD-DFC  of  NN  Like  Architectures 


will  not  find  the  same  architecture  as  the  Neurd  Nets,  we  hope  that  it  will  find  some 
other  architecture  with  at  most  the  same  DFC.  If  it  does  not  then  we  know  that  the 
AFD  approach  fails  to  recognize  this  class  of  patterns.  To  test  this  we  generated  10 
functions  with  each  of  the  NN  architectures.  The  function  for  each  box  was  selected 
randomly  from  the  2-variable  functions  with  DFC  of  4  (i.e.  constant  and  projection 
functions  were  excluded).  These  functions  were  then  run  on  AFD  version  2a.  The 
results  are  as  in  Table  5.16.  In  summary,  the  DFC  found  by  AFD  was  always  less 
than  that  of  the  original  NN  architecture. 

5.4.2  Run-Time  Performance 

Running  on  a  Vax  11/780  with  a  throughput  of  roughly  1  million-instructions-per- 
second  (MIPS),  the  different  versions  exhibited  run-times  ranging  from  less  than  a 
second  up  to  more  than  100,000  seconds.  In  order  to  design  a  reasonable  number 
of  experiments  of  a  reasonable  size,  we  needed  an  ability  to  estimate  the  run-time 
as  a  function  of  the  number  of  variables,  number  of  cares  and  number  of  minority 
elements,  and  version  of  the  AFD  program.  This  ability  to  estimate  run-time  allowed 
us  to  experiment  with  the  best  version  of  the  AFD  algorithm  that  time  would  allow. 

The  first  step  in  our  run-time  analysis  was  to  make  comparisons  of  all  algorithms 
on  the  Set  A  and  Set  B  functions.  Table  5.17  shows  the  average  run-time  comparisons 
for  Set  A  and  Set  B.  Run-time  is  measured  in  seconds  of  CPU  time  on  our  Vax  11/780. 
Although  the  cost  reduction  performances  of  all  the  versions  were  roughly  equivalent, 
their  run-times  varied  greatly. 

Some  versions  always  had  lower  run-times  and  decomposed  functions  to  lower  costs 
than  certain  other  versions.  For  instance,  version  2a  runs  faster  than  version  2ab  and 
finds  an  equal  or  lower  cost  decomposition  for  every  function  in  Set  B.  After  getting 
rid  of  the  versions  that  did  not  show  any  increase  in  cost  reduction  performance  to 
justify  their  increase  in  run-time,  we  were  left  with  the  folk-wing  versions,  ranked 
from  longest  to  shortest  run-times:  version  3,  version  1,  version  4a,  version  2b  and 
version  2a. 

We  searched  the  PT  1  data  base  and  pulled  out  the  functions  on  a  given  number  of 
variables  that  were  submitted  to  one  of  the  five  ‘good’  versions  and  recorded  how  many 
there  were  and  their  maximum,  minimum  and  average  run-times.  There  were  two 
distinctions  that  we  made  in  this  process.  First  of  all,  we  did  not  include  run-time  data 

93 


Version 

'Set  A 

Set  B 

2a 

0.9 

2.4 

2 

1.0 

2.5 

2ab 

1.5 

4.0 

2b 

2.0 

4.9 

4a 

7.0 

13.0 

4ab 

10.4 

13.7 

4 

10.6 

13.7 

4b 

14.0 

16.0 

1 

70.0 

275.0 

3 

220.0 

NA 

Table  5.17:  Average  Run-time  for  Set  A  and  Set  B 

on  functions  with  vacuous  variables.  The  reason  for  this  is  that  once  the  AFD  program 
has  eliminated  one  or  more  vacuous  variables,  it  is  then  decomposing  a  function  of 
one  or  more  fewer  variables.  Secondly,  we  made  a  distinction  between  the  run-times 
on  the  functions  that  did  decompose  and  the  ones  that  did  not.  The  run-times  of 
the  functions  that  did  not  decompose  all  tended  to  be  very  closely  grouped,  while  the 
functions  that  did  decompose  generated  run-times  that  varied  widely.  Tables  5.18 
and  5.19  show  these  results.  The^four  entries  for  each  version  -  number  of  variables 
combination  are  (from  top  to  bottom):  number  of  runs,  minimum  run-time,  average 
run-time,  maximum  run-time.  Note  the  expected  exponential  trend  in  increasing 
run-times  for  any  given  version  on  functions  that  do  not  decompose. 

Several  experiments  were  performed  to  assess  the  relationship  between  run-time, 
DFC  and  number  of  minority  elements.  The  number  of  minority  elements  is  the 
number  of  elements  of  a  function  that  have  output  1  or  the  number  of  elements  that 
have  output  0,  which  ever  is  smaller.  These  experiments  were  all  carried  out  on 
version  2a. 

Figure  5.15  shows  the  relationship  between  run-time  and  DFC  for  eight  variables. 
We  found  no  consistent  pattern  other  than  a  general  tendency  for  functions  that 
decompose  to  have  larger  average  and  much  larger  maximum  run-times. 

It  was  found  that  one  of  the  largest  factors  influencing  run-time  was  the  number  of 
minority  elements  in  a  function,  particularly  when  the  number  of  minority  elements 
was  very  small;  therefore,  functions  were  generated  on  four,  five,  six,  seven,  eight  and 
nine  variables  which  each  had  a  fixed  number  of  minority  elements. but  were  otherwise 
generated  with  no  intended  pattern.  The  results  are  shown  in  Figures  5.16  through 
5.18.  Functions  with  a  proportionally  small  number  of  minority  elements  always 
decompose  somewhat,  and  most  of  those  that  decompose  have  widely  varying  run¬ 
times.  Once  the  number  of  minority  elements  increases  past  a  critical  point  (around 
15  percent  of  the  cardinality  of  the  function)  then  a  randomly  generated  function  will 
I  usually  not  decompose  (or  decompose  very  little)  and  the  run-time  associated  with  it 


94 


Table  5.18;  Run-times  for  Functions  That  Did  Not  Decompose 


95 


Table  5.19:  Run-Times  for  Functions  That  Did  Decompose 


96 


Thousands 


Figure  5.15:  Ruii-time  versus  DFC  for  Functions  on  Eight  Variables 


Run  time 


N  •  6 

Avg  Run  time  Min  Run  time  Max  Run  time 

Figure  5.16:  Run-Tinie  versus  Number  of  Minority  Elements  for  Six  Variable  Func¬ 
tions 


Run  time 


N  •  7 

—  Avg  Run  time  Min  Run  time  Max  Run  time 


Figure  5.17:  Run-Time  versus  Number  of  Minority  Elements  for  Seven  Variable  Func¬ 
tions 


Avg  Run  time  Min  Run  time  Max  Run  time 


pgure  5.18:  Run-Time  versus  Number  of  Minority  Elements  for  Eight  Variable  Func¬ 
tion^ 


98 


will  be  very  close  to  the  expected  run-time  for  any  unpatterned  function. 

5.4.3  Summary 

There  was  little  difference  in  the  average  DFC  performance  between  versions.  There 
are  cases  though  with  substantial  differences.  There  was  a  lot  of  difference  in  the 
average  run-time  between  versions.  We  narrowed  the  number  of  versions  to  the  five 
best,  a  set  of  versions  where  an  increase  in  average  run-time  always  corresponded  to 
an  increase  in  cost  reduction  performance.  The  experimental  run-time  data  was  con¬ 
sistent  with  the  expected  exponential  relationship  between  run-time  and  the  number 
of  variables.  There  is  a  great  deal  of  run-time  sensitivity  to  the  number  of  minority 
elements  in  a  function.  Even  the  faster  versions  had  sufficient  cost  reduction  perfor¬ 
mance  to  allow  for  some  interesting  results  in  the  Pattern  Phenomenology  experiments 
(Chapter  6). 


5.5  Summary 

This  chapter  introduces  the  problem  of  function  decomposition  and  discussed  the  Ada 
implementation  that  was  used  in  this  project.  Approaches  to  decomposition  consist 
of  a  test  for  decomposability  and  a  search  methodology.  The  test  for  decomposability 
is  understood  theoretically  and  is  described  in  detail.  The  approaches  to  searching 
have  little  theoretical  basis.  We  described  several  approaches  that  were  experimented 
with  in  this  project.  We  found  that  a  1  MIPS  machine  is  capable  of  decomposing 
most  functions  on  less  than  10  binary  variables  in  a  matter  of  hours. 


99 


Chapt^i? 


.  wKeA  you  can,  mcMure.  wh^t  yo.u,  are  speaking  a^ut,  and  express,  it 
in  nunabei:s,j  yo.a  know,  so;m^thing  about  iij^;  but  w^en  yo,u  cann,ot  express, 
it  in  numb.ers,  youX']^n,P,Ay],edge.^^  meager. an,d  uns.atisfac^ory  kind  . 

-  Lo,rd'  Kelvin., 

6.1  lELtroduction 

This  chapter  h,as  t\yO;Objectixes.  OivtbiC  pnc  hand  we  experiipentdly  test  o.ur- assertipi\ 
that  Decompojsed  BVactioA  CardinaUtyr  ('DFQ).  is  a  yery  general  ipeas.^ire  of  pt.tern- 
ness.  On  the  other  hand,^  assuming  tfeat  DFC  is  a  general  m.easure  pf  pattern-ness, 
we  begin  sizing  up  the  world  relative  to  this  new  metric. 

The  first  pbjective  is,  complementary  tp,  the  Chapter  4  theoretical  d.enipnst, ration 
of  dec’s  gener^ty.  In  Chapter  4  we  related  DFC  tp  the  usual  measures  of  cornputa- 
tional  complexity;  information  theoretic-  program  length,  algorithinic  time  conxplejdty 
and  circuit  size  eprnpleMty. 

This  chapter  reports  pattern-ness  nreasurements  for  many  different  kinds  of  things. 
Theae  experiinents  are  viewed,  in  a  way  analogous  to  the  first  experiments,  with  pthe; 
scientific  instrumenta.  For  exauiplei,  the  first  uses  of  the  inercury  expansion  ther¬ 
mometer  were,  at  the  s,a^P  ti.Ptc,  as.aessing  the  reasonableness  of  the  tpn^perature 
readings,  and.  quantifyiug  for  the  hrst  fipie  yarious  important  temperatures  (e.g-  melt¬ 
ing  points).  The  aualpgy  extend?  to  thP  difference  between  perceived  tPrPPPrature 
and  actual  teinperatare,  “Temperature”  first  became  a  concept  based  on  the  general 
sensation  of  waruith  and  cold.  It  only  became  a  precise  objectiye  physical  Property 
after  the  invention  pf  a  therm.pineter.  We  npw  accept  that  any  difference?,  between 
sensed  and  nieas.ured  temperatures  are  dpe  to  ‘‘thermal  illusions”  rather  than  some 
failing  of  thermometers.  While  w®  insist  that  a  temperature  measuring  device  reflect 
general  trends  in  sensed  texnperature  we  expect  situations  were  there  are  differences. 
When  we  are  cold,  things,  seem  to  be  warmer  than  a  thermometer  ^ould  indicate. 
Therefore)  a  thermometer  is  not  an  exact  predi^tpr  of  sensed  fcrnperature.  Hpweyer) 
we  would  not  accept  a  thermometer  as.  a  measure  of  ternperature  if  it  gp,t  tpo.  far 


Pattern 


from  our  expectations.  We  think  of  DFC  as  a  measure  of  pattern-ness  much  like 
a  thermometer  is  a  measure  of  temperature.  We  expect  “pattern  illusions”  of  two 
kinds.  First  there  will  be  functions  that  are  patterned  whose  pattern-ness  is  of  a  kind 
that  we  do  not  appreciate.  Thinking  now  in  terms  of  the  pattern-ness  of  an  image 
as  compared  to  the  DFC  of  the  function  that  generates  the  image,  some  patterned 
functions  (e.g.  the  prime  number  acceptor)  will  not  look  patterned  to  us.  A  second 
pattern  illusion  might  result  from  our  tendency  to  impose  a  certain  degree  of  order  on 
things.  For  example,  people  can  see  a  face  in  almost  any  image  with  two  horizontally 
displaced  dark  spots.  As  more  specific  examples,  reference  Figure  6.13  and  6.17;  note 
that  character  31  of  font  3  looks  patterned  but  does  not  have  low  DFC;  character  48 
of  font  2  with  permuted  variables  does  not  look  patterned  but  has  low  DFC. 

There  are  a  few  complicating  factors  in  these  experiments.  For  one  thing  we 
cannot  be  sure  that  the  AFD  program  has  found  the  true  minimum  cost.  Most  of  the 
runs  reported  in  this  chapter  were  done  with  version  2a,  the  balance  were  done  with 
version  4a.  As  indicated  in  Chapter  5  we  do  not  think  we  are  missing  the  optimum 
by  very  much,  but  there  is  certainly  some  bias  on  the  measured  DFC  as  compared  to 
the  true  minimum  DFC.  Also,  as  discussed  in  Chapter  4,  DFC  does  not  include  the 
costs  of  interconnections.  Therefore,  things  that  decompose  but  do  so  with  high  DFC 
may  not  be  patterned  at  all.  Finally,  the  AFD  program’s  run-time  is  exponential  in 
the  number  of  input  variables.  Therefore,  we  are  limited  to  measuring  the  DFC  of 
functions  with  no  more  than  about  ten  variables. 


6.2  Randomly  Generated  Functions 

6.2.1  Introduction 

Patterned  functions  are  both  extremely  rare  and  extremely  common.  If  you  look  at  all 
functions  and  choose  a  function  at  random,  assuming  all  functions  are  equally  likely, 
then  the  function  almost  certainly  will  not  be  patterned.  Have  you  ever  noticed 
that  if  you  watch  a  television  that  is  not  receiving  a  signal  that  realistic  images 
never  seem  to  appear?  If  what  we  are  seeing  is  a  truly  random  image  then  any 
particular  realistic  image  (e.g.  John  Wayne  sitting  on  a  horse)  is  just  as  probable 
as  any  particular  random  looking  image.  You  can  watch  these  random  images  for 
a  long  time  and,  although  some  strange  psychological  phenomena  start  to  happen, 
you  never  see  anything  that  is  remotely  realistic.  Realistic  images  are  patterned  and 
patterns  are  extremely  rare  in  the  space  of  all  functions.  On  the  other  hand  if  you 
pick  realistic  functions,  such  as  functions  with  names  (e.g.  addition,  sine,  palindrome 
acceptor),  real  images,  real  sounds,  etc.  then  the  function  almost  certainly  will  be 
patterned.  That  is,  patterns  are  extremely  common  in  the  real  world.  It  seems  to  us 
that  this  is  a  physical  property  of  the  world,  kind  of  like  mass  or  any  other  physical 
property. 

This  section  will  assess  the  cost  distribution  (mean,  maximum,  minimum,  and 
occasionally  standard  deviations)  for  arbitrary  functions,  functions  with  a  specific 


102 


number  of  minority  elements,,  and  functions  ^yith  a  specific  number  of  cares.  This 
assessment  will  be-  based^  on  theoretical  and  experiment^  results. 


6.2.2  Completely  Random  Functions 

This  section  is  concerned  with  the  cost  of  completely  random  functions.  That  is,  the 
class  of  functions  to  be  considered  includes  all  functions  and  each  function  has  the 
same  probability  of  being  chosen  as  any  other  function.  Of  course  these  functions  are 
“random?’  only  in  the  sense  that  generating  a  function  using  a  random  source  of  O’s 
and  I’s  is  one  way  of  generating  a  sample  of  this  class.  All  the  functions  in  this  class 
are  completely  deterministic.  That  is,  they  all  produce  a  specific  output  for  a  specific 
input.  Later  sections  will  limit  class  membership  to  those  functions  with  a  specific 
number  of  minority  elements  or  a  specific  number  of  cares. 

We  need  some  initial  results  that  will  allow  us  to  characterize  the  number  of 
functions  with  respect  to  DFC. 

Theorem  6.1  The  minimum  DFC  of  a  function  is  either  0,  2,  or  a  sum  of  powers 
of  2  greater  than  2,. 

Proof: 

DFC  is  a  sum  of  powers  of  2  by  definition.  If  a  minimum  representation  has  cost 
greater  than  2  and  an  individual  component  of  cost  2  then  we  must  have  a  situation 
where  the  component  of  cost  2  (pi)  is  connected  to  another  cpmpoiient  (p2)'  There¬ 
fore,  p2  could  be  redefined  without  increasing  p2^s  cost  such  that  pi  is  not  required. 
The  representation  with  the  redefined  p2  would  have  lower  cost  than  the  original  rep¬ 
resentation  which  contradicts  the  assumption  that  we  began  with  a  miniirium  cost 
representation.  Therefore,  th>?  assuinption  that  we  can  have  a  minimum  cost  repre¬ 
sentation  with  a  coinponent  of  cost  2  is  false. 

□ 


Theorem  6.2  An  integer  n  is  the  sum  of  powers  of  2  greater  than  2  if  and  only  ifn 
is  evenly  divisible  by  4- 

Proof: 

=>• 

Assume  that  n  is  the  sum  of  powers  of  2  greater  than  2,  i.e.  n  =  2''''  where  each 

Pi  >  2.  Thus,  n  =  2P!"^  where  p,-  -  2  >  0.  The  sum  is  a  whole  number  ^ince 

each  of  the  terms  is  a  whole  number.  Thus,  n  is  divisible  by  4. 


Assume  that  n  is  divisible  by  4,  i.e.  there  exists  a  whole  number  n'  such  that  n  =  4n'. 
Thus  n  =  E"=i  4. 

□ 


103 


Theorem  6.3  All  functions  must  have  DFC’s  of  0,.  S,  or  multiples  of 
Proof: 

Follows  from  Theorems  6,1  and  6.2. 

□ 


Theorem  6.4  The  maximum  number  of  non-vacuous  variables  (n')  for  a  function 
with  DEC  >iisn'  =  ^  + 

Proof; 

It  follows  from  Theorem  6.5  that  4(n'  —  1)  is  the  minimum  cost  for  n'  non-vacuous 
variables;  that  is  n'  <  -H  1.  The  equality  follows  since  -1-  1  is  always  an 
integer  by  Theorem  6.2. 

□ 


Now  we  consider  the  number  of  functions  of  a  given  cost: 

•  cost  =  0:  There  are  n  projection  functions  and  2  constant  function  for  a  total 
of  n  -}-  2  functions  of  cost  0. 

•  cost  =  2:  There  are  n  complements  of  the  projection  functions  for  a  total  of  n 
functions  of  cost  2. 

•  cost  =  4:  There  can  only  be  .two  non-vacuous  variables  by  Theorem  6.4.  There 
are  n  choose  2  (or  n(n  -  l)/2)  pairs  of  variables.  Let  n'  =  2.  There  are  2^ 
total  functions  on  n'  variables.  Of  these,  n'  -f  2  have  cost  0  and  n'  have  cost 
2.  By  Theorem  6.3  above,  no  functions  have  cost  1  or  3.  Therefore,  there  are 
2^"  —  (n'  -f  2)  —  n'  functions  left  of  cost  4.  Since  n'  =  2,  2^"  —  (n'  +  2)  —  n'  =  10. 
Therefore,  there  are  10Ti(n  -  l)/2  functions  of  cost  4. 

•  Cost  =  8:  There  can  only  be  three  non-vacuous  variables.  There  are  n  choose 

3  (or  n(n  —  l)(n  —  2)/6)  triples  of  variables.  Let  n'  =  3.  There  are  2^"  total 
functions  on  n'  variables.  Of  these,  n'  -f  2  have  cost  0,  n'  have  cost  2  and 
10n'(n'  —  l)/2  functions  of  cost  4.  By  Theorem  6.3  above,  no  functions  have 
cost  1,  3,  5,  6  or  7.  Therefore,  there  are  2^"  —  (n'  +  2)  —  n'  —  10n'(n'  —  l)/2 
functions  left  of  cost  8.  Since  n'  =  3,  2^"  —  (n'  -f  2)  —  n'  -  10n'(n'  — 1)/2  =  218. 
Therefore,  there  are  218n(n  —  l)(n  —  2)/6  functions  of  cost  8. 

•  Cost  =  12:  There  can  only  be  four  non-vacuous  variables.  There  are  n  choose 

4  (or  n{n  —  l)(n  —  2)(n  —  3)/24)  groups  of  variables.  Let  n'  =  4.  There  are 
2^  total  functions  on  n'  variables.  Of  these,  n'  +  2  have  cost  0,  n'  have  cost  2, 
10n'(n'  —  l)/2  functions  of  cost  4  and  218n'(n'  —  l)(n'  —  2)/6  functions  of  cost 
8.  By  Theorem  6.3  above,  no  functions  have  cost  1,  3,  5,  6,  7,  9,  10,  11,  13, 
14,  or  15.  We  do  not  know  how  many  functions  there  are  of  cost  16,  but  we 


104 


DFC 

Number  of  functions 

0 

6 

2 

4 

4 

60 

8 

872 

12 

5,794 

16 

58,800 

Total 

Average 

Less  than  16 

10.28% 

Table  6.1:  Number  of  Functions  for  a  Given  DFC 

believe  there  are  58800,  This  comes  from  the  Pascal  Function  Decomposition 
Experiment  where  we  decomposed  virtually  all  functions  on  4  variables.  The 
number  of  functions  of  cost  0,  2,  4  and  8  corresponds  exactly  to  that  predicted 
above  for  n=4,  which  lends  some  credence  to  the  58800  figure.  Therefore,  there 
are  2^"  -  (n'  +  2)  -  n'  -  10n'(n'  - 1)/2  -  218n'(n'  -  l)(n'  -  2)/6  -  58800  functions 
left  of  cost  12.  Since  n'  =  4,  2^”  —  (n'  +  2)  —  n'  -  10n'(n'  —  l)/2  —  218n'(n'  - 
l)(n'-2)/6- 58800  =  5794.  Therefore,  there  are  5794n(n-l)(n-2)(n-3)/24 
functions  of  cost  12. 

This  procedure  does  not  work  for  cost  =  16.  We  are  only  assured  of  five  non- 
vacuous  variables  which  allows  for  many  costs  (i.e.  0,  2,  4,  8,  12,  16,  20,  24,  28,  and 
32). 

Virtually  all  functions  on  four  variables  were  run  on  a  Pascal  function  decompo¬ 
sition  program  similar  to  AFD  version  2a.  The  results  are  shown  in  Table  6.1,  where 
all  DFC  values  not  listed  had  zero  functions.  Note  that  the  number  of  functions  with 
costs  zero  through  12  matches  exactly  with  the  theoretical  results.  This  makes  us 
think  that  the  algorithm  finds  optimal  decompositions  on  functions  of  four  variables. 

In  summary,  there  are  n-t-2  functions  of  cost  0,  n  functions  of  cost  2, 10n(n  — 1)/2 
functions  of  cost  4,  218n(n  -  l)(n  —  2)/6  functions  of  cost  8,  and  5794n(n  —  i)(n  — 
2)(n  -  3)/24  functions  of  cost  12.  Figure  6.1  is  a  graph  of  these  relationships  for 
various  n’s. 

We  used  the  AFD  program  to  extend  the  theoretical  results  on  DFC  histograms. 
We  generated  745  random  functions  on  five  variables  and  decomposed  them  wdth 
versions  1,  4a,  2a  and  3.  Because  of  limited  computer  resources  version  3  was  run 
only  on  the  first  92.  The  combined  results  were:  711  functions  had  cost  32,  22  had 
cost  28  and  12  had  cost  24.  Based  on  this,  we  would  estimate  that  zero  percent  of 
the  functions  on  five  variables  have  cost  0  through  20,  12/745  x  100  =  1.6  percent 
have  cost  24,  22/745  x  100  =  3.0  percent  have  cost  28,  and  711/745  x  100  =  95.4 


105 


Nuniber  ol  functions 


—  n-3  -t-  n-e  n-S  ~®-  n'12 

n-l6  nM8  n"21  n-24 


Figure  6.1:  Number  of  Functions  versus  DFC  for  n  up  to  24 

percent  have  cost  32.  However,  we  must  also  consider  that  better  decompositions 
may  exist  for  many  of  the  functions  than  we  were  able  to  find.  Version  3  found  no 
new  decompositions  beyond  those  found  with  versions  J,  or  4a  but  did  reduce  the 
cost  of  one  decomposition  from  28  to  24.  Based  on  this  fact,  we  are  suspicious  that 
our  estimated  number  of  functions  is  inflated  for  a  cost  of  28.  Figure  6.2  plots  the 
estimated  frequencies  for  costs  of  24,  28  and  32  along  with  the  theoretical  frequencies 
developed  earlier. 

We  generated  400  random  functions  on  six  variables;  version  4a  found  no  decom¬ 
positions  and  version  1  found  no  decompositions  in  the  first  100  of  these.  Therefore, 
we  can  be  98  percent  confident  that  the  fraction  of  “decomposable”  functions  on  six 
variables  is  less  than  1  percent.  By  “decomposable,”  we  mean  that  the  function  has 
DFC  <  2". 

We  generated  50  random  functions  on  each  of  7,  8,  9,  and  10  variables,  none 
of  which  decomposed.  Version  4a  was  used  for  n  =  7,8,  and  9  and  version  2a  for 
n  =  10.  Therefore,  we  can  be  92  percent  confident  that  the  fraction  of  decomposable 
functions  for  each  of  these  number  of  variables  is  less  than  five  percent. 

The  number  of  functions  with  L(e(r/))  <  /  is  less  than  or  equal  to  2^  by  the 
information  theoretic  constraints  (see  Theorem  A. 17).  This  upper  bound  (i.e.  2*) 
is  fairly  consistent  with  the  trends  in  the  experimental  data  and  is  perhaps  a  good 
estimate  for  the  number  of  functions  of  cost  less  than  or  equal  to  1. 

Functions  on  a  very  large  number  of  variables  (n  >  16)  will  always  decompose; 
at  least  a  little  bit.  This  result  is  from  Lupanov’58  [34],  see  also  Savage’76  [54, 
pp. 116-120).  The  Lupanov  representation  breaks  a  function  into  components.  It  is 
possible  to  realize  these  components  with  a  cost  savings  if,  instead  of  realizing  the 
components  independently,  we  first  compute  all  possible  minority  elements  and  then 
use  each  minority  element  many  times  in  computing  the  Lupanov  components. 


106 


Number  ol  lunclions 


P-Cost 

——  n"l  n'2  n'O  “®“  0“* 

-H-  n'5  r>"6  ^‘7  r>*8 


Figure  6.2:  Number  of  Functions  versus  DFG  for  n  =  5 

The  Lupanov  representation  has  a  complexity  that  grows  as  2"/n.  Based  on  rough 
calculations,  the  constants  involved  causes  the  Lupanov  representation  to  be  greater 
than  2"  for  n  <  16.  Therefore,  for  the  size  of  functions  that  we  have  experimented 
with  (i.e.  n  <  10),  this  is  not  a  factor.  In  terms  of  the  general  theory  though,  it 
is  important.  As  discussed  in  Chapter  3,  PT  1  is  concerned  with  realizing  a  single 
function.  We  noted  that  when  realizing  multiple  functions  there  would  be  some 
economy  in  re-using  certain  computations.  However,  for  PT  1,  we  did  not  want  to 
deal  with  this  complication.  Therefore,  we  defined  the  PT  1  problem  to  only  involve 
a  single  function.  The  Lupanov  upper  bound  makes  it  clear  that  even  when  realizing 
a  single  function,  the  fact  that  you  generate  multiple  intermediate  functions  makes 
re-use  a  factor.  Its  not  a  terribly  big  factor;  that  is  2"  versus  2"/n,  but  a  factor  none 
the  less.  Of  course  the  Information  Theoretic  constraints  (Appendix  A)  still  apply; 
therefore,  if  the  cost  measure  included  the  cost  of  the  interconnections  then  it  is  not 
a  factor  at  all.  In  a  sense,  the  Lupanov  representation  gets  the  DFC  below  2"  by 
making  the  interconnection  complexity  very  large. 

In  summary,  for  the  vast  majority  of  functions  on  five  to  ten  variables,  DFC 
is  the  same  as  the  function’s  cardinality.  Therefore,  we  expect  randomly  generated 
functions  to  have  DFC  =  2"  and  when  the  DFC  is  less  than  2"  then  there  is  something 
special  about  that  function.  Of  course  we  think  that  “something  special”  is  that  it  is 
patterned.  In  recognition  of  the  Lupanov  upper  bound,  this  is  not  true  for  functions 
with  more  than  16  variables.  When  dealing  with  large  functions,  it  may  be  desirable 
to  reconsider  our  metric  and  possibly  include  interconnection  costs. 


107 


6.2.3  Functions  with  a  Specific  Number  of  Minority  Ele¬ 
ments 

Some  functions  have  a  definite  minority  in  terms  of  output  type.  That  is,  a  function 
may  have  0  as  its  output  for  all  biit  a  small  number  of  inputs.  In  this  case  we  would 
say  that  1  is  the  minority  output  type.  An  element  of  a  function  with  output  of  the 
minority  type  is  called  a  “minority  element.”  This  section  will  investigate  the  cost 
of  functions  with  respect  to  their  number  of  minority  elements.  Functions  with  a 
relatively  small  fraction  of  minority  elements  are  common  in  practice.  For  example, 
the  prime  number  acceptor  only  outputs  a  1  for  the  relatively  few  numbers  that  are 
prime  and  a  target  detection  algorithm  would  only  output  a  1  on  the  extremely  small 
fraction  of  possible  images  that  include  targets. 

There  is  a  minimum  cost  for  a  function  with  a  given  number  of  non-vacuous 
variables.  In  particular,  if  a  function  has  n  non-vacuous  variables  then  the  cost  of  the 
function  cannot  be  less  than  4(n  —  1).  Therefore,  if  a  function  on  n  variables  has  i 
vacuous  variables  the  cost  of  this  function  is  at  least  4(n.  —  i  —  1). 

Theorem  6.5  If  F  is  the  set  of  binary  functions  with  n  non-vacuous  variables  then 
the  greatest  lower  bound  on  the  cost  of  f  £  F  is  4(n  —  1). 

Proof: 

There  exists  a  function  f  in  F  with  n,-  =  2  for  i  =  1, . . . ,  P  where  n;  is  the  number 
of  input  variables  for  component  p,-  of  the  representation  of  /  and  P  is  the  total 
number  of  components  in  the  representation.  We  are  assured  of  /’s  existence  since 
for  any  representation  with  a  p;  with  n,-  >  3  we  could  partition  the  variables  of  p,- 
into  groups  of  size  n;  —  1  and  1  with  cost  2"'“*  +  2^  which  is  less  that  2"'  for  n,-  >  3. 
The  resulting  function  would  be  an  element  of  F  since  the  composition  of  functions 
which  are  essentially  dependent  on  all  their  inputs  is  a  function  that  is  essentially 
dependent  on  all  of  its  inputs.  We  can  solve  for  P  for  such  a  representation  since  the 
total  number  of  variables  input  to  the  p,’s  is  the  original  n  input  variables  plus  the  P 
variables  generated  by  some  p,-  minus  the  final  output  variable  which  is  not  an  input. 
That  is, 

p 

Y^ni  =  n  +  P  -1 
1=1 

,  which  reduces  to  P  =  n  —  1  since  all  n,-  =  2.  The  cost  of  this  representation  is: 

1=1  i=l 

This  is  a  greatest  lower  bound  since  there  exist  functions  with  this  cost,  e.g.  sj  -1-X2  + 

- 1-  x„  where  +  is  an  OR  operation  and  the  x,’s  are  Boolean  variables. 

□ 


There  is  a  relationship  between  the  number  of  minority  elements  in  a  function 
and  the  possible  existence  of  vacuous  variables.  Intuitively,  if  a  function  has  an  odd 


108 


number  of  minority  elements  then  it  is  not  possible  for  a  partition  matrix  to  have  two 
identical  sets  of  columns  and  two  identical  sets  of  columns  are  possible  for  a  function 
with  a  vacuous  variable.  We  state  this  as  a  theorem. 

Theorem  6.6  Suppose  F  is  the  set  of  functions  on  n  variables  with  exactly  k  minor¬ 
ity  elements,  where  k  is  an  integer  greater  than  zero.  There  exists  f  ^  F  such  that  f 
has  i  vacuous  variables  if  and  only  if  k  has  2*  as  a  factor. 

Proof: 

==> 

f  has  k  minority  elements  and  i  vacuous  variables.  If  there  are  no  vacuous  variables 
then  2°  is  obviously  a  factor.  Otherwise,  choose  a  vacuous  variable.  The  partition 
matrix  with  respect  to  this  variable  has  two  identical  columns.  Of  course  the  columns 
contain  the  same  number  of  minority  elements  and  therefore  the  total  function  had  an 
even  number  of  minority  elements.  If  we  now  drop  this  vacuous  variable  and  repeat 
the  argument  for  the  remaining  function  we  see  that  there  is  a  factor  of  2  in  for 
each  vacuous  variable. 

Let  k  =  k'2\  We  can  construct  an  /  with  i  vacuous  variables  one  vacuous  variable  at 
a  time.  For  the  first  vacuous  variable,  let  /(. . . ,0, . ..)  =  but  leave  the 

specific  values  undefined.  Now  consider  /(...,  0, .. .)  as  the  “function”  and  repeat  the 
procedure  t  times.  This  is  possible  since  there  are  a  multiple  of  2'  minority  elements. 
The  final  “function”  can  be  defined  arbitrarily  as  long  as  it  has  k'  minority  elements. 
A  function  so  constructed  has  i  vacuous  variables  and  k'2'  minority  elements, 

□ 


This  result  could  be  used  as  a  search  constraint  in  the  decomposition  process.  For 
example,  if  there  are  an  odd  number  of  minority  elements  then  there  is  no  point  in 
testing  for  vacuous  variables. 

We  now  have  a  connection  between  the  number  of  minority  elements  and  the 
number  of  vacuous  variables  as  well  as  between  the  number  of  vacuous  variables  and 
the  minimum  cost. 

Theorem  6.7  Let  F  be  the  set  of  binary  functions  on  n  variables  with  exactly  k 
minority  elements.  Let  i'  be  the  largest  i  such  that  2'  is  a  factor  of  k.  Then  for  all 
f  E  F  the  cost  of  f  is  >  4(n  —  i'  —  1)  and  there  exists  an  f  ^  F  such  that  the  cost  of 
f  equals  4(n  -  i'  —  1).  That  is,  the  cost  of  functions  in  F  has  a  greatest  lower  bound 
ofi{n  —  i'  —  1). 

Proof: 

This  follows  from  Theorem  6.5  and  Theorem  6.6. 

□ 


109 


DFC 


Number  of  Minority  Elements 

Expetlmenlal  Avg  Experimental  Min  Experlmenial  Max 

■*“  Upper  Bound  Lower  Bound 


Figure- 6.3:  DFC  With  Respect  to  Number  of  Minority  Elements,  n=4 

This  lower  bound  on  cost  could  be  used  as  a  stopping  condition  in  a  search  for 
a  best  decomposition.  That  is,  if  the  cost  of  the  best  decomposition  so  far  is  at  the 
lower  bound  then  the  search  can  be  stopped. 

We  have  an  upper  bound  on  the  cost  of  a  function  in  terms  of  the  number  of 
minority  elements.  This  upper  bound  is  based  on  the  fact  that  individual  minority 
elements  can  be  generated  with  a  product  and  then  summed  together. 

Theorem  6.8  If  f  is  a  function  on  n  variables  with  k  minority  elements  then  the 
cost  of  f  is  less  than  or  equal  to  4(nfc  —  1). 

To  expand  upon  the  theoretical  bounds  we  generated  a  series  of  random  functions 
with  a  controlled  number  of  minority  elements  and  determined  their  cost  with  the 
AFD  program.  We  generated  ave  functions  each  for  4, 6,  7,  8  and  9’  input  variables  for 
each  of  the  six  different  fractions  of  minority  elements.  On  functions  of  five  variables, 
we  generated  an  average  of  about  500  functions  for  all  possible  number  of  minority 
elements  (i.e.  0  -  15).  The  results  of  these  experiments  are  plotted  in  Figures  6.3 
through  6.7. 

Table  6.2  lists  the  percentage  of  minority  elements  for  a  given  fraction  of  cost. 
Based  on  this  we  might  expect  that  the  average  cost  of  functions  with  less  than  10 
percent  minority  elements  to  be  less  than  0.9  [/]  and  the  average  cost  of  functions 
with  less  than  5  percent  minority  elements  to  be  less  than  0.5[/]. 

6.2.4  Functions  with  a  Specific  Number  of  Don’t  Cares 

A  random  total  function  was  generated  for  each  n  from  4  to  10.  These  functions 
were  then  “sampled”  and  decomposed.  By  sampling  we  mean  that  we  took  a  subset 

*Only  four  different  nine  variable  functions  were  generated  for  each  number  of  minority  elements. 


no 


5  variables 

—  Average  — Mir^imum  Maximum 


Figure  6.4:  DFC  With  Respect  to  Number  of  Minority  Elements,  n=5 


—  Experimenisl  Avg  Experimental  Min  Expetlmenlal  Max 

Upper  Bound  —  Lower  Bound 


Figure  6.5:  DFC  With  Respect  to  Number  of  Minority  Elements,  n=6 


ID 

minority  element  fraction  at  0.5{/] 

minority  element  fraction  at  0.9[/] 

4 

.06 

.25 

5 

.03 

.16 

6 

.05 

.19 

7 

.06 

.13 

8 

.07 

.16 

Table  6.2:  Number  of  Minority  Elements  Required  for  a  Given  Cost 


111 


DFC 


Number  of  Minority  Elements 

—  Experimental  Avg  Experimental  Min  Experimental  Max 
Upper  Bound  —  Lower  Bound 


Figure  6.6;  DFC  With  Respect  to  Number  of  Minority  Elements,  n=7 


DFC 


Number  of  Minority  Elements 

—  Experimental  Avg  Experimental  Min  Experimental  Max 
Upper  Bound  —  Lower  Bound 


Figure  6.7:  DFC  With  Respect  to  Number  of  Minority  Elements,  n=8 


112 


Figure  6.8:  DFC  as  a  Function  of  the  Number  of  Cares 

of  the  total  function  and  then  decomposed  this  partial  function.  The  subsets  were 
chosen  randonily.  In  circuit  design  they  often  refer  to  the  points  not  in  a  partial 
function  as  “Don’t  Cares”  meaning  that  the  output  can  be  anything.  Therefore, 
the  size  of  the  partial  function  is  also  called  the  number  of  “cares.”  The  size  of 
the  subsets  were  2,  5,  7,  10,  12  and  15  for  n  =  4.  For  all  other  n  the  samples 
sets  contained  5  x  2”“®  x  i  elements,  where  i  =  1,2, ... ,6.  This  corresponds  to  the 
following  percentages:  15.6,31.3,46.9,62.5,78.1  and  93.8.  Figure  6,8  plots  the  results 
of  the  AFD  version  2a  runs. 

In  order  to  assess  the  sensitivity  of  these  results  to  the  sample  set,  we  repeated 
the  above  experiment  five  times  for  a  random  function  on  seven  variables.  That  is, 
five  different  sample  sets  of  each  size  were  taken,  but  always  from  the  same  function. 
These  results  are  shown  in  Figure  6.9.  Note  that  DPFC  can  never  exceed  the  number 
of  cares.  The  DFC  of  a  random  function  appears  to  increase  linearly  with  the  number 
of  cares.  The  DFC  of  the  partial  functions  reach  the  DFC  of  the  total  functions  when 
the  number  of  cares  gets  up  to  between  60  and  80  percent. 


6.3  Non-randomly  Generated  Functions 

In  this  section  we  go  around  with  our  new  instrument  and  measure  the  pattern-ness 
of  many  different  kinds  of  functions.  We  are  testing  the  generality  of  our  measure 
and  exploring  pattern-ness. 

We  identified  several  classes  of  functions  that  are  small  enough  to  be  tested  with 
the  AFD  program.  These  classes  include  numerical  functions  and  sequences,  symbolic 
functions,  string  manipulation  functions,  a  graph  theoretic  function,  images,  and  files. 
Note  that  although  we  only  report  the  DFC  resulting  from  the  AFD  runs,  when- 


113 


Number  of  Samples 


—  Avg  Error  *■  Min  Error  »  Max  Error 
"*■  Avg  0-Cares  Chance  P-Cost 


°  Min  Cost 
’  Max  Cost 


Figure  6.9:  DFC  as  a  Function  of  the  Number  of  Cares,  71  =  7 
ever  the  DFC  is  low  the  AFD  program  is  also  yielding  an  algorithm. 

6.3.1  Numerical  Functions  and  Sequences 

By  “numerical  functions”  we  mean  the  usual  arithmetic  operations  (addition,  subtrac¬ 
tion,  multiplication  and  division),  square  and  cube  roots,  trigonometric  and  logarith¬ 
mic  functions,  prime  and  Fibonacci  number  acceptors,  the  greatest  common  divisor 
and  the  matrix  determinant.  The  computation  of  numerical  functions  has  a  long  his¬ 
tory.  The  algorithms  used  to  compute  these  functions  have  been  hand-crafted  over 
the  centuries.  To  underscore  the  significance  of  what  AFD  does  on  these  problems, 
try  to  imagine  that  we  did  not  know  how  to  compute  addition  and  we  were  given 
the  task  of  designing  an  addition  algorithm.  We  might  get  out  our  algorithm  design 
text,  such  as  [3];  but  what  technique  would  we  use?  Are  we  going  to  try  to  repre¬ 
sent  the  problem  with  a  graph?  Can  we  apply  Heap  Sort?  What  about  dynamic 
programming?  Where  in  this  admittedly  excellent  text  on  algorithm  design  is  there 
a  method  that  would  result  in  a  good  algorithm  for  addition?  Of  course  the  point 
is,  algorithm  design  texts  do  not  give  methods  for  algorithm  design,  they  give  a  tool 
kit  of  hand-made  algorithms  to  choose  from.  Another  point  is  that  there  are  many 
problems,  including  some  as  simple  as  addition,  that  are  not  naturally  constructed 
from  things  in  the  tool  kit. 

In  this  section  we  apply  PT  to  a  table  defining  addition  and  it  produces  the 
familiar  digit-wise  add  and  carry  algorithm  that  we  all  learned  in  grade  school.  We 
believe  that  it  is  very  important  that  this  algorithm  was  produced  by  a  computer^ 
automatically,  and  the  same  computer  program  found  the  patterns  in  many  other 
kinds  of  functions. 

Numerical  functions  are  represented  as  binary  functions  by  using  the  usual  binary 


114 


Output  Bit 

n 

2” 

DFC 

%  DFC 

1 

8 

256 

4 

1.6 

2 

8 

256 

12 

4.7 

3 

8 

256 

20 

7.8 

4 

8 

256 

28 

10.9 

5 

8 

256 

28 

10.9 

1280 

92 

Table  6.3:  Addition. 

equivalent  of  each  input  padded  on  the  left  with  zeros  to  give  the  input  the  appropriate 
length.  When  there  are  two  input  numbers,  as  in  addition,  the  binary  function 
representation  has  as  input  the  concatenation  of  the  binary  equivalents  of  the  two 
input  numbers.  When  the  output  is  a  non-binary  number,  we  represent  the  numerical 
function  by  a  binary  function  for  each  bit  in  the  binary  representation  of  the  output. 
Consider  addition  as  an  example.  Define  the  numerical  function  addition  to  be  the 
sum  of  two  numbers  between  0  and  15.  The  output  will  be  between  0  and  30.  Each 
input  number  must  be  represented  by  four  bits.  The  output  must  be  represented 
by  five  bits.  Therefore,  we  represent  addition  as  five  binary  functions,  each  on  eight 
variables.  Although  these  5  functions  are  decomposed  independently,  we  sometimes 
refer  to  them  as  a  single  function.  Numerical  sequences  are  represented  as  “acceptors.” 
For  example,  the  Fibonacci  sequence  is  represented  as  a  numerical  function  whose 
output  is  0  if  the  input  is  not  in  the  Fibonacci  sequence  and  whose  output  is  1  if 
the  input  is  in  the  sequence.  Tables  6.3  through  6.12  show  the  results  for  numerical 
functions. 

Addition  was  represented  by  a  binary  function  of  the  form 

/:{0,l}••x{0,lr-4{0,l}^ 

Output  bit  1,  the  least  significant,  is  simply  an  exclusive  OR  of  the  least  significant 
bits  of  the  two  inputs.  The  decomposition  found  for  the  more  significant  output  bits 
is  the  familiar  binary  adder  in  combinational  form.  Notice  that  the  more  significant 
bits  re-compute  the  less  significant  bits,  so  addition  can  be  realized  with  a  cost  of 
20  rather  than  92.  We  have  deliberately  not  tried  to  exploit  this  kind  of  savings 
(see  Section  3.4.9).  However,  this  suggests  an  approach  of  first  decomposing  one  of 
multiple  functions  and  then  treating  its  output  and  intermediate  states  as  inputs  to  a 
second  function.  In  general,  the  DFC  of  adding  two  m-bit  numbers  is  16m  8.  The 

cost  of  only  the  most  significant  bit  is  8m  —  4. 

Subtraction  was  represented  by  a  binary  function  of  the  form 

.  /:{0,l}-‘x{0,l}'->{0,lp 

Output  bit  5  is  the  sign  bit  and  output  bit  1  is  the  least  significant  of  the  4  bits  in 
the  binary  output. 


115 


Table  6.4:  Subtraction. 


Table  6.5:  Multiplication. 


116 


Output  Bit 

n 

2” 

DFC 

%  DFC 

1 

8 

256 

168 

65.6 

2 

8 

256 

72 

28.1 

3 

8 

256 

24 

9.4 

4 

8 

256 

16 

6.2 

Total 

1024 

280 

27.3 

Table  6.6:  Modulus. 


Output  Bit 

n 

2” 

DFC 

%DFC 

1 

8 

256 

124 

48.4 

2 

8 

256 

134 

52.3 

3 

8 

256 

92 

35.9 

4 

8 

256 

28 

10.9 

Total 

1024 

378 

36.9 

Table  6.7:  Remainder, 

Multiplication  was  represented  by  binary  functions  of  the  form 

/;{0,l}5x{0,l)?-,{0,l}". 

We  evaluated  functions  with  n  equal  to  4,  6,  and  8.  The  output  bits  are  listed  in 
order  of  increasing  significance.  Note  that  for  n  =  8,  output  bits  5  and  6  did  not 
decompose.  One  possible  cause  for  this  is  that  you  need  the  results  of  computing  the 
less  significant  bits  before  these  bits  are  patterned.  Another  explanation  might  be  that 
multiplication  has  a  significant  “buy-in”  cost.  That  is,  some  patterned  functions  do 
not  decompose  until  the  number  of  variables  gets  above  some  threshold.  The  six-bit 
multiplication  had  a  lower  percentage  DFC  than  the  eight-bit.  This  is  in  part  due  to 
our  using  the  better  version  4a  on  six-bit  multiplication.  When  six-bit  multiplication 
is  run  on  version  2a  (which  is  what  we  had  to  use  on  eight-bit  multiplication),  the 
total  DFC  is  46.9  percent. 

Modulus  was  represented  by  binary  functions  of  the  form 

/;{0,l}'x{0,  {0,1}“ 

where  the  output  is  the  integer  part  of  divided  by  X2. 

Remainder  was  represented  by  binary  functions  of  the  form 

where  the  output  is  the  integer  part  of  the  remainder  from  sj  divided  by  X2. 


117 


Output  Bit 

n 

2" 

DFC 

%DFC 

1 

8 

256 

80 

31.2 

2 

8 

256 

36 

14.1 

3 

8 

256 

12 

4.7 

4 

8 

256 

4 

1.6 

Total 

1024 

132 

12.9 

Table  6.8:  Square  Root. 


Output  Bit 

n 

2" 

DFC 

%DFC 

1 

136 

26.6 

2 

24 

4.7 

3 

9 

512 

8 

1.6 

Total 

1536 

168 

10.9 

Table  6.9:  Cube  Root. 

Square  root  was  represented  by  a  binary  function  of  the  form 

The  output  is  the  binary  equivalent  of  the  integer  part  of  the  square  root  of  the  input. 
The  cube  root  was  represented  by  a  binary  function  of  the  form 

/:  {0,1}” -.{0,1}’. 

The  output  is  the  binary  equivalent  of  the  integer  part  of  the  cube  root  of  the  input. 

The  first  quadrant  of  the  sine  function  was  represented  by  a  binary  function  of 
the  form 

/:  {0,1}’ -.{0,1}*. 

An  input  of  x  degrees  was  encoded  as  the  binary  equivalent  of  the  integer  part  of 
255/90  t.imes  ®.  The  input  varied  from  0  to  90  degrees.  The  output  is  the  binary 
equivalent  of  the  integer  part  of  255  sin  x.  The  output  bits  are  listed  in  order  of 
increasing  significance. 

The  logarithm  function  was  represented  by  a  binary  function  of  the  form 

{0,1}* -.{0,1}’. 

The  output  is  the  integer  part  of  the  logarithm  to  the  base  2  of  the  input.  The  output 
bits  are  listed  in  order  of  increasing  significance. 

The  “greater  than”  function  is  of  the  form 

/;  {0,1}' X  {0,1}' ^{0,1}, 


118 


Output  Bit 

n 

2" 

DFC 

%DFC 

1 

8 

256 

256 

100.0 

2 

8 

256 

256 

100.0 

3 

8 

256. 

256 

100.0 

4 

8 

256 

224 

87.5 

5 

8 

256 

176 

■  68.8 

6 

8 

256 

104 

40.6 

7. 

8 

256 

54 

21.1 

8 

8 

256 

24 

9.4 

Total 

2048 

1350 

65.9 

Table  6.10:  Sine. 


OutpiU  Bit 

n 

2" 

DFC 

%  DFC 

i 

8 

256 

24 

9.4 

2 

8 

256 

20 

7.8 

3 

8 

256 

12 

4.7 

Total 

768 

56 

7.3 
_ 1 

Table  8.11:  Logarithm. 


Function 

n 

2" 

DFC 

%DFC 

Greater  Than 

8 

256 

28 

10.9 

Factorial 

5 

32 

28 

87.5 

Table  6.12:  Miscellaneous  Numerical  Functions. 


119 


n 

2" 

DFC 

%DFC 

6 

64 

64 

100.0 

7 

128 

104 

81.2 

8 

256 

196 

76.6 

9 

512 

336 

65.6 

10 

1024 

600 

58.6 

Table  6.13: 

Primality  Tests. 

n 

2” 

DFC 

%  DFC 

5 

32 

24 

75.0 

6 

64 

48 

75.0 

7 

128 

76 

59.4 

8 

256 

108 

42.2 

9 

512 

144 

28.1 

Table  6.14:  Fibonacci  Numbers. 

where  the  output  is  1  if  the  first  4  bit  input  is  greater  than  the  second  4  bit  input 
and  0  otherwise. 

The  “factorial”  function  is  of  the  form 

where  the  output  is  1  if  and  only  if  the  input  is  some  factorial  (i.e.  1,  2,  6  or  24). 
The  prime  number  acceptors  are  functions  of  the  form 

/  :  {0,1}" ->{0,1}, 

where  the  output  is  1  if  the  input  is  a  prime  number  and  0  otherwise  (zero  and  one 
were  considered  prime).  B,uns  were  made  with  n  ranging  from  6  to  10.  This  is  an 
example  where  n  must  be  large  before  the  function  begins  to  decompose. 

The  Fibonacci  number  acceptors  are  functions  of  the  form 

/  : {0,1}"  -> {0,1}, 

where  the  output  is  1  if  the  input  is  a  Fibonacci  number  and  0  otherwise.  Runs  were 
made  with  n  ranging  from  5  to  9. 

One  experiment  concerned  the  patterns  in  deciding  whether  or  not  a  binomial 
coefficient  is  an  odd  number.  About  100  years  ago,  E.  Lucas  discovered  that  a  choose 
b  is  odd  if  and  only  if  every  bit  in  b  implies  its  corresponding  bit  in  a.  This  is  a 
highly  patterned  computation  according  to  Pattern  Theory  and  the  AFD  program 


120 


Function 

n 

2" 

DFC 

%DFC 

Lucas  Function 

6 

64 

20 

Lucas  Function 

8 

256 

28 

■EH 

Table  6.15:  DFC  of  Lucas  Functions. 


Function 

n 

2” 

DFC 

%DFC 

Binomial  Coeff. 

6 

64 

12 

4.7 

Binomial  Coeff. 

8 

256 

20 

7.8 

Table  6.16:  DFC  of  Binomial  Coefficient  Based  Functions. 


rediscovered  this  pattern.  Tables  6.15  and  6.16  summarize  these  results.  The  binomial 
coefficient  functions  are  of  the  form 


/;{o,ir  x{o,ir-.{o,i}, 

m  is  3  or  4  and 


/(o,6) 


1  if  a  choose  b  is  odd 

<  0  if  a  choose  h  is  even 

don’t  care  if  a  choose  b  is  undefined  (i.e.  6  >  a) 


The  Lucas  functions  are  of  the  form 


if  each  bit  in  6  implies  its  corresponding  bit  in  a 
otherwise 

Lucas’s  Theorem  [12,  p.2]  says  that  g{a,  b)  =  /(o,  6)  when  /  is  defined.  Note  that 
g  has  by  definition  a  decomposition  of  minimum  cost  for  no  vacuous  variables  (i.e. 
Z>F(7(p)  =  4(n-l)). 

The  Greatest  Common  Divisor  (GCD)  problem  is  represented  by  a  function  of  the 
form 

/:{0,1}-'X{0,1}‘‘->{0,1}‘, 

Since  we  are  not  interested  in  the  GCD  when  either  number  is  zero,  0000  represents 
decimal  1,  0001  represents  decimal  2,  . . . ,  1111  represents  decimal  16.  Output  bit  1 
is  the  most  significant. 

The  determinant  function  is  represented  by  a  function  of  the  form 

/  ■•{0,1}'’ ^{0,1}“. 


m  is  3  or  4. 

I  I 


121 


Output  Bit 

n 

2" 

DFC 

%DFC 

1 

8 

256 

48 

18.8 

2 

8 

256 

172 

67.2 

3 

8 

256 

164 

64.1 

4 

8 

256 

4 

1.6 

Total 

1024 

388 

37.9 

Table  6.17:  DFC  of  Greatest  Common  Divisor  Function. 


Output  Bit 

n 

2” 

DFC 

%  DFC 

zero/non-zero 

9 

512 

128 

25.0 

positive/negative 

9 

512 

108 

21.1 

±1  or  ±2 

9 

512 

.34 

6.6 

Total 

1536 

270 

17.6 

Table  6.18:  DFC  of  the  Determinant  Function. 

The  input  is  a  3  x  3  binary  matrix.  The  first  output  bit  indicates  whether  or  not 
the  determinant  is  zero.  The  second  output  bit  indicates  whether  or  not  a  non¬ 
zero  determinant  is  positive  or  negative.  The  third  output  bit  indicates  whether 
the  absolute  value  of  a  non-zero  determinant  is  1  or  2.  Note  that  the  size  of  the 
determinant  function  is  logs  2*^  =  1188.83  since  there  are  only  five  possible  outputs 
(i.e.  -2, -1,0, 1,2). 

In  summary,  about  76  numeric  functions  were  run  on  the  AFD  program.  All 
decomposed  except  for  3  of  the  18  bits  associated  with  multiplication,  3  of  the  8  sine 
bits  and  the  prime  number  acceptor  on  6  variables.  Note  that  when  the  same  function 
was  done  on  different  numbers  of  variables  (such  as  multiplication,  primality  test  and 
the  Fibonacci  number  acceptor),  the  percentage  cost  went  down  as  the  number  of 
variables  went  up.  The  binomial  coefficient  based  functions  were  exceptions.  Also 
note  that  the  number  of  bits  that  do  not  decompose  for  multiplication  was  increasing 
with  the  number  of  variables. 


6.3.2  Language  Acceptors 

We  were  encouraged  by  the  ability  of  DFC  to  reflect  the  patterns  in  a  wide  variety  of 
numerical  functions.  In  this  section,  we  generate  “languages”  to  see  if  the  DFC  of  a 
completely  different  kind  of  pattern  is  also  low. 

We  did  a  series  of  experiments  on  a  class  of  functions  called  “language  acceptors.” 
An  abstract  language  is  simply  a  subset  of  a  set  of  strings.  A  language  acceptor  is  a 
function  which  outputs  1  if  the  input  string  is  in  the  language  and  outputs  0  if  the 
input  string  is  not  in  the  language.  There  is  a  one-to-one  correspondence  between 


122 


total  binary  functions  on  strings  and  languages.  An  arbitrary  language  is  not  different 
from  a  random  function;  therefore,  we  would  not  expect  an  arbitrary  language  to 
correspond  to  a  function  that  decomposes.  We  are  interested  in  languages  that  can 
be  generated  with  a  formally  defined  grammar.  A  grammar  is  a  set  of  rules.  Each  rule 
has  a  left  part  and  a  right  part;  both  parts  are  strings.  An  element  in  a  language  is 
generated  by  starting  with  a  special  starting  symbol  and  then  replacing  symbols  that 
are  the  left  part  of  some  rule  in  the  grammar  with  the  right  part  of  that  same  rule.  If 
there  exists  some  sequence  of  rule  applications  that  will  yield  a  given  string  from  the 

*  starting  symbol  then  that  string  is  in  the  language.  See  [22]  for  a  concise  introduction 

to  formal  languages.  A  context-free  grammar  is  one  in  which  the  left  parts  of  the 
rules  consist  of  a  single  symbol.  In  [22,  pp. 178-179]  there  are  algorithms  for  accepting 
languages  defined  by  context-free  grammars  in  cubic  time.  If  a  language  is  defined  by 
a  context-free  grammar  then  we  would  expect  the  language  acceptor  for  that  language 
to  be  patterned.  Therefore,  whether  or  not  language  acceptors  decompose  is  a  test 
of  the  generality  of  the  DEC  measure  of  pattern-ness. 

To  conduct  a  test  of  the  DEC  measure  relative  to  language  acceptance  we  gen¬ 
erated  two  programs.  One  randomly  generates  a  set  of  context-free  syntactic  rules, 
i.e.  a  context-free  grammar.  The  grammars  used  in  this  experiment  were  generated 
and  then  edited.  The  minor  editing  was  necessary  to  remove  duplicate  rules  and 
generally  ensure  that  a  non-trivial  language  resulted.  The  second  program  takes  a 
grammar  as  input  and  generates  the  corresponding  language  accepting  function  in 
AED  input  format.  Although  languages  generally  include  strings  of  any  length,  we 
defined  the  language  acceptor  function  only  for  input  strings  of  a  fixed  length  (in 
particular,  lengths  of  9  bits).  This  was  necessary  because  the  AED  program  is  de¬ 
signed  for  functions  defined  on  vectors  (which  are  the  same  as  fixed  length  strings). 
Reference  Appendix  A  for  more  on  the  relationship  between  functions  on  strings  and 
functions  on  vectors.  A  sample  of  the  languages  resulting  from  these  software  tools 
is  shown  in  Table  6.19. 

The  languages  were  then  decomposed  with  version  2a  of  the  AED  program.  The 
results  of  this  experiment  are  in  Table  6.20.  In  summary,  14  context-free  language 
acceptors  were  generated  and  decomposed.  The  highest  cost  was  about  25  percent 
and  the  average  cost  was  less  than  10  percent.  This  result  supports  the  contention 
that  DEC  measures  the  pattern-ness  of  syntactically  patterned  functions. 

6.3.3  String  Manipulation  Functions 

,  In  this  section  we  generate  functions  with  yet  another  class  of  patterns.  These  func¬ 

tions  are  most  easily  thought  of  as  functions  on  binary  strings.  The  “palindrome” 
function  outputs  1  if  the  input  binary  string  is  symmetric  about  its  center  and  out¬ 
puts  0  otherwise.  The  “majority  gate”  function  outputs  1  if  the  binary  input  contains 
more  ones  than  zeros  and  outputs  0  otherwise.  The  “counting  four  ones”  function 
outputs  a  1  if  and  only  if  the  input  binary  string  contains  exactly  four  ones.  The 
“parity”  function  outputs  1  if  and  only  if  the  input  binary  string  has  an  odd  number 
of  ones.  The  “XOR”  is  the  exclusive  OR  function  where  the  output  is  1  unless  the 


123 


Language  13 

Language  4 

Language  2 

Language  11 

aaaaaaaab 

babbbbbba 

ababababa 

&CICI81  &>&•&•  8tSL 

aaaaaabab 

babbbbbbb 

abababbab 

aaaaabbbb 

aaaaabaab 

bbabbbbba 

abababbba 

aaaabbbba 

aaaabaaab 

bbabbbbbb 

ababbabab 

aabaaabbb 

aaaababab 

bbbabbbba 

ababbabba 

aababaabb 

aaabaaaab 

bbbabbbbb 

ababbbaba 

aabababab 

aaabaabab 

bbbbabbba 

ababbbbab 

aabababba 

aaababaab 

bbbbabbbb 

ababbbbba 

aababbaaa 

aabaaaaab 

bbbbbabba 

abbababab 

aabaaabab 

bbbbbabbb 

abbababba 

abaaaaaaa 

aabaabaab 

bbbbbbaba 

abbabbaba 

abaaabbba 

aababaaab 

bbbbbbabb 

abbabbbab 

ababaabba 

aabababab 

bbbbbbbaa 

abbabbbba 

ababababa 

baaaaaaab 

bbbbbbbab 

abbbababa 

abababbaa 

baaaaabab 

bbbbbbbba 

abbbabbab 

ababbaaaa 

baaaabaab 

bbbbbbbbb 

abbbabbba 

abbaaaaaa 

baaabaaab 

abbbbabab 

baaaaaaaa 

baaababab 

abbbbabba 

baabaaaab 

abbbbbaba 

baabaabab 

abbbbbbab 

baababaab 

abbbbbbba 

babaaaaab 

babaaabab 

babaabaab 

bababaaab 

babababab 

bbbbbbbba 

Table  6.19:  Sample  Languages. 


124 


Language 

n 

2» 

DFC 

%  DFC 

No.  of  Productions 

Non-terminals 

1 

EQI 

28 

5 

1 

2 

mm 

48 

9.4 

6 

2 

3 

9 

512 

32 

6.3 

7 

2 

^  4 

9 

512 

52 

10.2 

8 

2 

5 

9 

512 

0 

0 

8 

2 

6 

9 

512 

4 

0.8 

9 

2 

7 

9 

512 

32 

6.3 

6 

3 

8 

9 

512 

20 

3.9 

10 

3 

9 

9 

512 

32 

6.3 

10 

3 

10 

9 

512 

116 

22.7 

11 

3 

11 

9 

512 

124 

24.2 

12 

3 

12 

9 

512 

28 

5.5 

11 

4 

13 

9 

512 

64 

12.5 

15 

5 

14 

9 

512 

72 

14.1 

16 

5 

Average 

46.6 

9.1 

Table  6.20:"  DFC  of  Language  Acceptors. 


Function 

n 

2" 

DFC 

%  DFC 

Palindrome 

8 

256 

28 

10.9 

Majority  Gate 

7 

128 

48 

37.5 

Majority  Gate 

9 

512 

96 

18.8 

Counting  Four  Ones 

7 

128 

64 

50.0 

Parity 

7 

128 

24 

18.8 

Parity 

8 

256 

28 

10.9 

Parity 

9 

512 

32 

6.3 

XOR 

7 

128 

24 

18.8 

V 

Table  6.21:  Miscellaneous  S^rJrig  Manipulation  Functions. 


125 


Output  Bit 

n 

2” 

DFC 

%DFC 

1 

8 

256 

28 

10.9 

2 

8 

256 

60 

23.4 

3 

8 

256 

68 

26.6 

4 

8 

256 

68 

26.6 

5 

8 

256 

68 

26.6 

6 

8 

256 

68 

26.6 

7 

8 

256 

60 

23.4 

8 

8 

256 

28 

10.9 

Total 

2048 

448 

21.9 

Table  6.22:  Sorting  Eight  1-Bit  Numbers. 


Output  Bit 

n 

2” 

DFC 

%  DFC 

8 

256 

12 

4.7 

8 

256 

56 

21.9 

8 

256 

16 

6.3 

8 

256 

160 

62.5 

5 

8 

256 

16 

6.3 

6 

8 

256 

184 

71.9 

8 

256 

12 

4.7 

8 

256 

56 

21.9 

2048 

512 

25.0 

Table  6.23:  DFC  of  Sorting  Four  2-Bit  Numbers. 

input  is  all  zeros  or  all  ones.  The  results  of  the  decomposition  of  these  functions  are 
shown  in  Table  6.21. 

We  did  two  “sorting”  experiments.  One  considers  sorting  eight  1-bit  numbers 
and  the  other  considers  sorting  four  2-bit  numbers.  Note  that  when  sorting  one-bit 
numbers,  6tt,(x)  =  1  —  6tt9_,(255  —  x).  Also,  for  sorting  four  2-bit  numbers,  the  higher 
order  bits  of  the  output  (i.e.  bits  1,  3,  5,  7)  are  independent  of  the  low  order  bits  of 
the  input  (i.e.  bits  2,  4,  6,  8).  That  is,  inputs  2,  4,  6,  and  8  are  vacuous  in  bits  1,  3, 
5,  and  7. 

In  summary,  six  different  string  based  functions  were  considered  (palindromes, 
majority  gate,  counting,  parity,  exclusive  OR,  and  sorting).  A  total  of  24  binary 
functions  derived  from  these  were  run  on  AFD  version  2a.  All  24  functions  decom¬ 
posed.  This  supports  the  contention  that  DFC  reflects  the  patterns  in  string  based 
functions. 


126 


Bit 

Arc 

1 

1,2 

2 

1,3 

3 

1,4 

4 

1,5 

5 

2,3 

6 

2,4 

7 

2,5 

8 

3,4 

9 

3,5 

10 

4,5 

Table  6.24:  Input  Bits  Represent  Arcs 

6.3.4  A  Graph  Theoretic  Function 

This  experiment  concerns  the  patterns  in  deciding  whether  or  not  a  graph  has  a  k- 
clique.  A  graph  has  a  k-clique  if  it  has  k  nodes  that  are  completely  connected  (i.e, 
each  of  the  k  nodes  has  an  arc  to  all  the  other  nodes  in  the  clique).  The  k-clique 
problem  is  NP-complete  (54,  p.4].  Therefore  we  would  not  expect  this  function  to  be 
highly  patterned. 

The  functions  used  in  this  experiment  are  of  the  form  /  :  {0, 1}*°  — >  {0, 1}.  The 
input  is  the  representation  of  an  undirected  graph  with  five  nodes.  Each  bit  in  the 
input  indicates  whether  or  not  a  given  arc  is  in  the  graph  as  in  Table  6.24.  For 
example,  there  is  an  arc  between  nodes  1  and  2  if  bit  1  of  the  input  is  1.  Therefore, 
the  10  input  bits  represent  a  graph  and  the  output  of  the  k-clique  function  is  1  if  the 
graph  has  a  k-clique  and  0  otherwise.  We  will  consider  each  of  the  k-clique  functions 
for  fc’s  of  1  through  5. 

We  cannot  do  the  1-clique  problem  with  this  set  up  since  we  do  not  allow  for 
arcs  from  a  node  to  itself.  However,  this  limitation  does  not  affect  the  other  k-clique 
functions  since  such  arcs  do  not  change  whether  or  not  a  graph  has  a  k-clique  (for 
k  >  1).  That  is,  if  we  had  defined  g  :  {0,1}‘®  {0,1}  and  bits  1-10  are  defined  as 

in  Table  6.24  and  with  bits  11-15  defined  as  in  Table  6.25  then  variables  11-15  would 
be  vacuous  for  k-clique  functions  with  A:  >  1. 

The  2-clique  function  has  only  one  minority  element  (i.e.  a  graph  has  a  2-clique 
unless  it  has  no  arcs),  therefore  has  cost  4(n— 1)  =  36.  Similarly,  the  5-clique  function 
has  cost  36. 

We  attempted  to  decompose  the  3  and  4-clique  functions  with  the  AFD  program, 
but  did  not  find  a  decomposition  as  good  as  the  Savage  sum-of-products  form  (the 
AFD  runs  had  not  finished  after  3  days).  The  best  known  decompositions  then  are 
as  in  Table  6.26. 

The  complete  k-clique  function  (i.e.  the  input  includes  k  and  the  graph)  could  be 


127 


Bit 

Arc 

11 

1,1 

12 

2,2 

13 

3,3 

14 

4,4 

15 

5,5 

Table  6.25:  Additional  Input  Bits  for  Arcs  to  Self 


k 

DFC 

Percentage  DFC 

1 

0 

0.0 

2 

36 

3.5 

3 

116 

11.3 

4 

116 

11.3 

5 

36 

3.5 

HU 

304 

5.9 

I 

Table  6.26:  DFC  of  the  Various  k-clidue  Functions  on  a  Graph  With  5  Nodes 

computed  from  1-clique,  2-clique, ...  as  follows.  Say  k  is  input  as  a  3-bit  number;  have 
five  circuits,  one  that  is  true  only  if  fc  =  1,  one  that  is  true  only  if  fc  =  2,  etc.  (called 
a  binary-to-positional  transformation).  The  of  these  circuits  is  then  AND’ed  with 
the  i-clique  function.  The  result  of  all  the  AND’s  is  then  OR’ed  together.  The  result 
is  the  k-clique  function.  The  realization  of  k-clique  just  described  has  the  cost  of  the 
individual  i-clique  functions  (304),  plus  the  cost  of  the  five  counters  (5x8=40),  plus 
the  cost  of  five  AND’s  (5x4=20)  and  four  OR’s  (4x4=16).  Therefore,  the  total  cost 
of  the  k-clique  function  of  the  form  /  :  {0, 1}^  x  {0, 1}*°  — >  {0, 1}  is  380.  The  size  of 
the  function  is  2^^,  so  as  a  percentage  the  DFC  is  4.6  percent. 

From  [54,  p.5],  we  see  that  the  k-clique  function  on  a  graph  with  four  nodes  has  the 
form  /  :  {0, 1}^  x  {0, 1}®  — >  {0, 1}.  The  3-clique  function  on  a  graph  with  four  nodes 
has  DFC  =  44.  As  in  the  k-clique  function  on  5-node  graphs,  the  k-clique  function 
on  4-node  graphs  could  be  computed  with  the  above  functions,  four  counters  (cost 
=  4x4  =  16),  four  AND’s  (cost  =  4x4  =  16),  and  three  OR’s  (cost  =3x4  =  12). 
Therefore,  the  total  cost  of  the  k-clique  function  on  a  4-node  graph  is  128.  The  size 
of  the  function  is  2®,  so  the  percentage  DFC  is  50  percent. 

In  summary,  the  DFC  of  the  k-clique  function  on  4-node  graphs  (input  size  8-bits) 
is  about  128  and  on  5-node  graphs  (input  size  13-bits)  is  about  380.  Despite  the  fact 
that  the  complexity  of  the  k-clique  function  grows  rapidly  (i.e.  NP)  as  the  size  of  the 
input  increases,  the  complexity  is  low  for  a  5-node  graph.  It  would  be  interesting  to 
evaluate  the  DFC  of  several  NP-Complete  problems  for  several  input  sizes  each.  We 


128 


k 

DFC 

Percentage  DFC 

■■ 

0 

0.0 

20 

31.3 

B 

44 

68.8 

B 

20 

31.3 

Total 

84 

32.8 

Table  6.27;  DFC  of  the  Various  k-clique  Functions  on  a  Graph  With  Four  Nodes 


Font  Name 

Number  of  Characters 

0 

default 

253 

1 

triplex 

94 

2 

small 

94 

3 

sans  serif 

95 

4 

Gothic 

95 

Table  6.28:  Turbo  Pascal  V5.5  Font  Sets 

only  have  two  data  points  on  this  single  NP-Complete  problem,  but  the  growth  in 
complexity  is  much  less  than  we  expected.  There  are  many  long-standing  unanswered 
questions  about  the  complexity  of  NP  problems  and  decomposition  may  be  an  avenue 
to  some  new  insights.  In  any  case,  DFC  reflects  the  patterns  in  yet  another  context. 

6.3.5  Images  as  Functions 

In  the  preceding  sections  we  saw  that  DFC  captures  the  essential  complexity  of  many 
kinds  of  functions.  In  this  section,  we  consider  the  pattern-ness  of  images.  Since  DFC 
measures  the  pattern-ness  of  binary  functions,  we  need  to  represent  an  image  as  a 
binary  function.  Suppose  we  want  to  assess  the  pattern-ness  of  a  16  pixel  by  16  pixel 
black  and  white  image.  We  represent  this  image  as  a  binary  function  of  the  form 

where  n  =  4  and  m  =  4.  The  first  4  bits  of  the  input  specify  a  column  of  the  image, 
the  last  4  bits  specify  a  row  and  the  output  is  the  color  (0  for  white  and  1  for  black) 
of  the  pixel  at  that  row  and  column. 

This  experiment  uses  a  16  by  16  pixel  sampling  of  the  characters  generated  by 
Turbo  Pascal  V5.5^  for  images.  There  are  631  total  characters  in  5  fonts,  distributed 
as  in  Table  6.28.  Each  character  was  drawn  on  a  VGA  monitor  with  the  maximum 
font  size  that  would  allow  aU  characters  of  a  given  font  to  fit  within  a  16  X  16  pixel 

^Borland  International. 


129 


Font  0 

Font  1 

Font  2 

Font  3 

Font  4 

All  Fonts 

Number  of  Characters 

253 

94 

94 

95 

95 

631 

Average  DFC 

44.0 

111.3 

78.6 

96.0 

141.9 

81.7 

Maximum  DFC 

64 

224 

160 

256 

256 

256 

Minimum  DFC 

0 

24 

20 

12 

24 

0 

Number  with  DFC=256 

0 

0 

0 

1 

11 

12 

Table  6.29:  Character  Images  DFC  Statistics 

square.  The  image  was  then  read  and  converted  into  the  Ada  Function  Decomposition 
(AFD)  input  format.  The  results  are  listed  in  Table  6.29. 

For  comparison  purposes,  the  images  are  listed  in  Figures  6.10  through  6.14  with 
their  character  number  and  their  DFC,  The  images  start  with  character  number  0  in 
the  upper  left  hand  corner  and  then  go  across  16  to  a  row.  The  rows  and  columns 
are  numbered  such  that  the  number  of  a  character  can  be  found  by  summing  its 
row  and  column  numbers.  The  numbers  just  to  the  right  of  each  character  is  the 
character’s  DFC.  Characters  1-30  of  font  3  and  characters  21-30  of  font  4  are  the 
same  as  character  31  in  each  of  the  fonts.  Although  these  characters  are  printed  in 
Figures  6.10  through  6.14,  they  were  not  run  and  are  not  included  in  the  statistics 
of  Table  6.29.  There  is  generally  a  one-to-one  correspondence  between  dots  in  the 
images  and  ones  in-the  output  of  the  functions  that  were  decomposed.  However,  there 
are  cases  where  some  differences  exist. 

In  summary,  the  average  DFC  for  all  the  characters  is  81.7  as  compared  to  256 
for  a  random  function.  Only  12  of  the  631  characters  did  not  decompose  (less  than  2 
percent).  Font  0  was  the  most  patterned  and  font  4  the  least.  These  results  support 
the  contention  that  DFC  measures  the  pattern-ness  of  images. 

6.3.6  Data  as  Functions 

Data  compression  depends  upon  finding  some  pattern  in  the  data,  for  example,  that 
there  are  long  strings  of  blanks  or  that  characters  are  repeated  many  times.  A  central 
thesis  of  PT  1  has  been  that  function  decomposition  is  a  way  to  recognize  almost 
any  kind  of  pattern.  As  another  test  of  this  thesis  we  used  the  AFD  program  to 
decompose  some  files.  The  results  of  the  decomposition  are  then  compared  to  the 
compression  achieved  by  two  popular  programs  for  the  PC,  PKZIP  and  PKARC^. 

AFD  is  limited  to  running  on  functions  of  about  9  variables  or  512  points.  This 
limits  direct  application  of  AFD  to  files  of  64  bytes.  We  ran  AFD  version  4a  on  five 
files: 

•  fgfile.pas  —  the  first  64  characters  of  a  pascal  program. 

^PKWare  Inc.  Glendale,  WI.  1987 


130 


0 

I 

a. 

s 

5* 

4 

7 

r 

T 

/• 

il 

a 

J3 

ly 

•S' 

igiyo 

0yo 

♦cy 

4iyf 

♦y® 

■  n 

□  x 

□yy 

[!]yy 

cfyy 

Sv. 

Ay 

flyy 

S«74. 

lU 

•<!*y 

u 

\u 

qiw 

§y» 

amjtti 

3feyo 

ty® 

4y. 

-♦■vy 

■f-vr 

^3(t 

■Hfr 

JbVf 

hu 

**/z 

ifv9 

$4y 

ScM 

■'xy 

Cvf 

Jyr 

*VJ 

+  ^y 

rn. 

— XO 

mZP 

y^y 

•it 

Qk. 

L*i 

2.1-i 

3w 

5w 

6vy 

7<*y 

8x4. 

9<w 

:/4. 

.yy 

■Cvy 

Jyy 

?<.y 

M<*y 

Cvr 

Dvf 

Eyy_^ 

Giv 

H^u 

Ix 

J*y 

Ky? 

Lj4 

Mvv 

N^y 

0x4’ 

?o 

P,. 

Qy? 

Rj4 

Sbi 

Tax 

Uxc 

Uf® 

U«y 

Xw 

Yy. 

Zfcv 

Da- 

\yy 

3« 

_? 

lb 

'ay 

•3(W 

h>ip( 

C<,y 

Cl*Y 

Tyy 

hvt 

ivy 

j»7 

Ifyr 

lyy 

IlirV 

nyy 

Ovy 

lu 

I'lr 

S*.( 

tv. 

Uw 

Wvy 

Xvr 

yw 

Ifcy 

"Cyy 

1 

l/X 

>yy 

~r 

Ciil 

n7 

S<,y 

^•yH 

Gut 

S(,y 

'3L4y 

^4y 

Aiy 

^ty 

©M 

e*y 

ew 

IW 

Tw 

l4y 

fi(.y 

«« 

M 

Eft 

«!yy 

fl:M 

O)^ 

Oy? 

Oyr 

Giy 

U4y 

y^y 

tfyr 

Ujj, 

^yo 

ixiv 

Yv. 

R*y 

/Uo 

■iiw 

iyr 

Wm 

in;y 

Fiw 

— *y 

Sx 

irjy 

r-yy 

■•w 

'■64y 

'44>y 

*iz 

«yr 

»yy 

17b 

Msy 

Mf 

If 

'Ky 

1w 

ILy 

•nt» 

iU. 

lir 

^Ivy 

=^4y 

JJ^, 

=l„ 

JU 

Txy 

in 

Ljt 

Tx* 

Ky 

—y 

4-no 

Kv 

IL® 

^y 

IfyA 

•*•4 

TTv, 

Ifv® 

~r 

Jl. 

JL. 

1.V 

Z67 

JUl^ 

T»y 

ino 

Ui<. 

t,;. 

Fjfc 

Ifzy 

^K® 

+x. 

«iy 

E!z 

C33^ 

B. 

B. 

ea 

o 

iVi 

«vf 

f^4y 

Ry 

IT, a 

S(,y 

®iw 

A»y, 

%H 

2vy 

5*y 

«>X4 

PJxy 

^4y 

Oji 

±lt 

>(.y 

l4y 

tvf 

Jyo 

“a 

JUT 

■~y» 

°xy 

*10 

•2» 

4-« 

'»ay 

•lb 

Figure  6.10:  Font  0  Images  and  DFC 


1 

X 

3 

y 

S' 

4. 

7 

7 

1 

10 

tx 

(■J 

/S' 

3X 

!.y 

II 

MH 

A^l.t 

^/ty 

or 

/ojiy 

^‘t»y 

I 

xa 

(« 

)fb 

*iy 

•I’xa 

.Xa 

—ay 

.IK 

Ai 

Hi 

0(«y 

lio 

3/4® 

3ir4 

4/ty 

5i|y 

61^4 

'tifU 

0/oT 

9(rt 

■  at 

I  v» 

</.r 

=*r 

>5t 

0 

!  la 

b‘i 

'S'riH 

A/m 

n(tt 

C-iol 

Pixx 

E,a 

.F„4 

G«x 

Hyy 

Ixe 

Jyz 

K/r> 

Ltt 

Mi7f 

N«e 

0„x 

7b 

^  1 

P.a 

Qwy 

&III 

T74 

U,. 

W,,u 

x„. 

Z/fc7 

fyo 

\x 

''n 

-ay 

‘  >t 

iiiiy 

liyi 

diib 

eye 

f?e 

8ni 

iiiij. 

1  f^O 

.iiy 

kiiy 

ly® 

11  „j. 

e«.y 

U2. 

Ojy 

>'74 

ti.y 

U„u 

'^wo 

^aa 

yi7& 

Z.XO 

I74. 

1 

':tr 

! 

'7Z 

S(b 

Figure  6.11:  Font  1  Images  and  DFC 


131 


o 

\ 

t  . 

3 

H 

* 

V 

7  , 

? 

/O 

IZ 

/3 

tS' 

sz 

Ur 

■‘*4 

•v 

''•im 

8c(,i 

J 

)7b 

“<r 

*3*. 

/it 

Ofo- 

I36 

2„. 

B/c4 

5|3S. 

CO 

9«> 

‘•fK 

=  a7 

>«■ 

•i’lvo 

Alto 

Bfo 

Cft 

Df4 

G|,* 

•Jll*. 

I<a. 

IV 

N,,,. 

Ofo 

?o 

U  lit. 

RjoT 

s,„ 

c: 

Usr 

Wy? 

Xvi 

Ytx 

2nr 

Iff 

3yr 

^6i 

-«a 

'■S7. 

a«r 

bjB 

Or* 

dj^« 

U« 

iar 

^''fr 

Wto 

^S(, 

Oto 

HZ 

p 

*71 

^fo 

v,» 

Xit 

-JM 

iT 

1x0 

•» 

‘“fX 

*-*  (SO 

Figure  6.12:  Font  2  Images  and  DFC 


0 

1 

z 

3 

7 

r 

n 

7 

r 

1 

;o 

II 

IX 

13 

/-v 

/S' 

0 

0 

G) 

© 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

/v 

0 

0 

iB 

0 

0 

0 

0 

0 

0 

0 

© 

0 

0 

0 

0 

©afifc 

J2. 

Ui 

'•■(■) 

//lot  $(f(. 

<y 

1 

3X 

(y. 

k 

V 

H-XL 

^iX 

— *4 

M 

/n 

Q/m 

'|5X 

2/^ 

3tn. 

4(40 

5/fco 

6(10 

Iru 

8174 

9(40 

Ur 

’>40 

''/of 

—it 

?,o4 

A  (46 

G714 

C|(U 

Diet 

/w 

Gtfx 

IV 

Ur 

Jyy 

K.X 

l-»r 

UUo 

H(4, 

0(or 

70 

l^r-f 

Q/40 

Rnu 

Stft 

T3i 

Uf» 

V,x4 

W(*o  X|if 

Y.., 

Z(>(f 

[.X 

\rr 

]  IX 

SZ 

-*? 

9«^ 

*31 

a»t 

b/js 

C(IJ. 

dixo 

®!xo 

V 

Olio 

h)!!* 

Ur 

I^(l(0 

Ur 

niiK, 

Hro 

oro 

1 

//3- 

P(X. 

Pa. 

'74. 

%*• 

tyf 

U(.» 

V 

'A^iix 

Xc 

^(04 

1. 

*7X 

'‘■'/oT 

Figure  6.13:  Font  3  Images  and  DFC 


0 

1 

2. 

3 

3" 

(f  7 

? 

?  1 

11 

12. 

13  If 

/£- 

/V 

t 

t 

•j-: 

t 

t 

t 

Jo 

i 

■r 

t 

1 

tf*/ 

J7, 

U4 

"h4 

fhit 

^>(44 

Cf 

^<^zxi 

>'^xr4 

rx 

u 

)y4 

V 

‘^Vx 

r3L 

“*4 

.  j4> 

/U 

(y? 

iUx 

Ijo 

^/*y 

divo 

4(jx 

5,„ 

IVh 

fiiit 

rt 

lUor 

Uh 

Isu 

*^(0t 

=  lt 

>rr 

?13X 

'Q'a4 

Aif4 

iSaV 

iBxiv 

iijof 

J*r«. 

®tX4 

®XX4 

3irb 

3  If  4 

Slu4  Szx4 

gb 

S>ztV 

®3i7  ^Xfi  ®ioo 

ili.74 

^2X4 

z.« 

[4. 

\*t 

-Jt 

-ir 

% 

t 

/X 

a  114 

1*140 

r.ix 

hiw 

r  IJ4 

f|o4 

ljut. 

l|((!4 

*4.4 

It  (40 

174 

ntiTx 

Ulil 

U|3X 

Itz 

P/sv 

fill-/ 

l’((4 

®ii(# 

^^140 

lW|4o 

^(7X 

P(2fc 

~au 

u 

1 

*7X 

u 

Figure  6.14:  Font  4  Images  and  DFC 


132 


•  story.in'  — a  64  character  sentience. 

•  command.com  —  the  first  64  characters  of  an  executable  file. 

•  xor7.cht  —  the  first  64  characters  of  a  graphics  file. 

•  parityS.fng  —  the  first  64  characters  of  a  numeric  data  file. 

Only  parityS.fng  decomposed,  to  a  cost  of  220.  The  size  of  all  these  files  increased, 
typically  to  around  90  bytes  when  PKARC’ed.  Apparently,  there  is  no  interesting 
compression  activity  for  files  that  are  this  small. 

In  order  to  allow  us  to  run  larger  files,  we  considered  each  bit  of  a  byte  to  be  a 
different  function.  Now  we  can  look  at  512  byte  files  as  8  functions  each  on  9  variables. 
We  ran  several  files  like  this. 

•  cin.asc,  537  bytes,  a  text  file  containing  the  first  512  characters  of  Chapter  1. 

•  hawaii.asc,  524  bytes,  a  text  file  containing  the  first  paragraph  of  an  example 
data  text  file  from  enableOA**. 

•  data.asc,  519  bytes,  a  section  of  the  numerical  data  from  the  FERD  experiment. 

•  dbf.asc,  520  bytes,  a  section  of  the  example  data  base  file  from  enableOA. 

•  west. muz,  519  bytes,  a  section  of  a  file  of  music  data  for  Pianoman®. 

•  auld.muz,  519  bytes,  a  section  of  a  file  of  music  data  for  Pianoman. 

•  mousea.ss,  516  bytes,  a  section  from  the  first  part  of  VGA  image  of  a  mouse. 

•  mouseb.ss,  516  bytes,  a  section  from  the  middle  of  VGA  image  of  a  mouse. 

•  mousec.ss,  516  bytes,  a  section  from  the  last  part  of  VGA  image  of  a  mouse. 

The  results  for  each  run  are  listed  in  Table  6.30  where  the  rows  identified  as  1,  2, 
etc.  are  the  DFC  for  output  bits  1,  2,  etc.  of  the  file  labeling  the  appropriate  column. 
The  row  labeled  “Total”  is  the  sum  of  the  DFC’s  for  all  eight  bits.  The  rows  labeled 
PKARC  and  PKZIP  are  the  compressed  file  size  in  bits.  These  results  are  summarized 
in  Table  6.31,  where  L  is  the  length  of  an  encoding  of  the  decomposition. 

We  also  generated  some  functions  with  trivial  decompositions  to  see  what  PKARC 
and  PKZIP  would  do  with  them. 

•  (file  name:  randOl)  A  random  string  of  ASCII  O’s  and  I’s.  The  file  had  514 
bytes.  Bits  1-7  had  zero  DFC  and  bit  8  did  not  decompose.  Therefore,  DFC  = 
512  or  12.5  percent,  L  =  646  or  15.8  percent,  PKARC  =  173  bytes  or  33.7  per¬ 
cent,  PKZIP  =  244  bytes  or  47.4  percent. 


'Trademaxk  of  Enable  Software  Inc. 
®Nei!  J.  Rubenking,  1988. 


133 


Bit 

cin 

hawaii 

data 

dbf 

west 

auld 

mousea 

mousec 

mouseb 

1 

0 

0 

0 

0 

36 

92 

236 

512 

512 

2 

512 

512 

0 

512 

42 

164 

512 

512 

512 

3 

148 

180 

0 

228 

92 

104 

368 

512 

512 

4 

512 

512 

512 

512 

76 

116 

512 

320 

512 

5 

512 

512 

316 

512 

0 

0 

348 

512 

512 

6 

512 

512 

276 

120 

68 

124 

•  512 

512 

512 

512 

512 

416 

288 

40 

80 

512 

512 

512 

512 

512 

512 

320 

96 

132 

512 

512 

512 

Total 

3220 

3252 

2032 

2492 

450 

812 

3512 

3904 

4096 

PKARC 

3232 

2336 

1688 

2336 

1672 

1664 

PKZIP 

3872 

2968 

2344 

2968 

1688 

1784 

Table  6.30:  DFC  and  Data  Compression  Results  for  Typical  Files 


lKin^i»icni 

ARC% 

DFC% 

L% 

ZIP% 

auld 

40.1 

19.8 

88.1 

43.0 

west 

40.3 

11.0 

88.1 

40.7 

data 

40.7 

49.6 

64.0 

56.5 

dbf 

51.2 

60.8 

88.1 

71.3 

mousea 

64.5 

85.7 

100.0 

80.6 

mouseb 

70.7 

100.0 

100.0 

87.0 

mousec 

74.0 

95.3 

100.0 

90.3 

cin 

75.2 

78.6 

87.7 

91.1 

hawaii 

80.7 

79.4 

88.1 

96.8 

Average: 

56.4 

64.5 

82.0 

70.2 

Table  6.31:  Data  Compression  Summary  for  Typical  Files 


134 


ARC%: 

DFC%: 

L%: 

ZIP%: 

zeros 

8.0 

0.0 

3.7 

30.5 

spaces 

8.7 

6.8 

27.8 

47.3 

zeroone 

16.7 

0.0 

3.7 

33.7 

majgate 

26.6 

2.3 

15.8 

45.1 

randOl 

33.7 

12.5 

15.8 

47.4 

asciistr 

94.4 

0.0 

3.7 

85.5 

Average: 

31.4 

3.6 

11.8 

48.3 

All  files  Average: 

48.4 

40.1 

58.3 

63.1 

Table  6.32:  Data  Compression  Summary  for  Atypical  Files 

•  (file  name:  majgate)  A  string  of  ASCII  O’s  and  I’s  defining  the  majority  gate 
function.  The  file  had  514  bytes.  Bits  1-7  had  zero  DFC  and  bit  8  decomposed 
to  96.  Therefore,  DFC  =  96  or  2.3  percent,  L  =  646  or  15.8  percent,  PKARC 
=  136  bytes  or  26.6  percent,  PKZIP  =  232  bytes  or  45.1  percent. 

•  (file  name:  zeros)  A  string  of  ASCII  O’s.  The  file  had  514  bytes.  All  8  bits  had 
zero  DFC.  Therefore,  DFC  =  0  or  0  percent,  L  =  152  or  3.7  percent,  PKARC 
=  41  bytes  or  8.0  percent,  PKZIP  =  157  bytes  or  30.5  percent. 

•  (file  name:  zeroone)  A  string  of  alternating  ASCII  O’s  and  I’s.  The  file  had 
516  bytes.  All  8  bits  had  zero  DFC.  Therefore,  DFC  =  0  or  0  percent,  L  = 
152  or  3.7  percent,  PKARC  =  86  bytes  or  16.7  percent,  PKZIP  =  174  bytes  or 

33.7  percent. 

•  (file  name:  spaces)  A  file  with  two  lines,  each  line  has  two  ASCII  I’s  separated 
by  254  spaces.  The  file  had  516  bytes.  Six  output  bits  had  zero  DFC,  the 
other  two  had  4  minority  elements  so  their  cost  is  no  more  than  140.  Therefore, 
DFC  =  280  or  6.8  percent,  L  =  1140  or  27.8  percent,  PKARC  =  45  bytes  or 

8.7  percent,  PKZIP  =  244  bytes  or  47.3  percent. 

•  (file  name:  asciistr)  A  file  with  two  lines,  each  line  lists  all  ASCII  characters  in 
order.  The  file  had  517  bytes.  All  output  bits  had  zero  DFC.  Therefore,  DFC 
=  0  or  0  percent,  L  =  152  or  3.7  percent,  PKARC  =  488  bytes  or  94.4  percent, 
PKZIP  =  442  bytes  or  85.5  percent. 

These  results  are  summarized  in  Table  6.32. 

The  ARC,  ZIP  and  AFD  data  are  not  directly  comparable  for  several  reasons. 
The  AFD  representation  is  “random  access”  while  the  ARC  and  ZIP  files  must  be 
decompressed  as  a  complete  file.  The  ARC  and  ZIP  compression  routines  run  many 
times  faster  than  AFD.  DFC  does  not  measure  a  complete  representation  (i.e.  does 
not  measure  the  interconnection  complexity)  and  AFD  does  not  optimize  L,  Also, 
for  n  =  9,  as  is  the  case  for  all  these  runs,  L  is  bigger  than  2"  unless  the  function  has 


135 


vacuous  variables.  That  is,  L  includes  a  n^[A]  term  and  for  no  vacuous  variables  [>i]  > 
n.  Therefore,  L  >  >  2"  for  n  =  9.  Therefore,  we  cannot  expect  decomposition 

to  be  a  realistic  data  compression  approach  for  small  n.  However,  despite  that,  the 
data  compression  performance  for  AFD  (as  measured  by  L)  is  in  the  same  ball-park 
as  ARC  and  ZIP.  That  is,  ARC  had  an  average  compression  factor  of  0.48  on  all 
files,  ZIP  0.63  and  AFD  0.58.  It  is  also  interesting  that  there  was  a  high  degree  of 
correlation  between  pattern-ness  as  measured  by  DFC  and  the  degree  of  compression 
achieved  by  ARC  and  ZIP.  For  the  typical  files,  the  correlation  coefficient  between 
DFC  and  ARC  was  0.87  and  between  DFC  and  ZIP  it  was  0.94.  Remember  that 
the  ARC  and  ZIP  compression  programs  were  written  by  someone  who  had  studied 
the  common  kinds  of  files  and  had  recognized  some  patterns  in  these  files  that  allow 
them  to  be  compressed.  ARC  and  ZIP  therefore  look  for  specific  kinds  of  patterns 
and  those  kinds  of  patterns  were  originally  discovered  by  a  person.  AFD,  on  the  other 
hand,  finds  the  patterns  itself.  The  patterns  it  found  in  the  files  allowed  compression 
that  was  comparable  to  that  of  the  hand-crafted  methods  of  ARC  and  ZIP.  That 
there  is  little  pattern-ness  beyond  that  considered  by  ARC  and  ZIP  is  evident  in  the 
high  degree  of  correlation  between  DFC  and  the  ARC  and  ZIP  compression  factors 
for  typical  files.  That  there  exist  some  kinds  of  patterns  that  ARC  and  ZIP  do  not 
look  for  is  evident  in  the  file  assciistr  where  the  DFC  is  0  yet  ARC  and  ZIP  had 
compression  factors  between  0.85  and  0.95. 

In  summary,  files  were  treated  as  functions  and  run  on  AFD.  Not  only  does  AFD 
find  patterns  in  yet  another  kind  of  function,  it  does  so  as  well  as  programs  hand¬ 
crafted  for  this  class  of  patterns.  The  generality  of  AFD  (and  the  lack  of  generality  in 
the  hand-crafted  programs)  is  indicated  by  an  example  where  the  compression  factor 
for  AFD  is  less  than  5  percent  of  that  of  the  hand-crafted  pattern  finders. 

6.3.7  Summary 

It  is  our  contention  that  DFC  measures  the  essential  pattern-ness  that  is  important  in 
computing.  We  have  the  AFD  algorithm  that  estimates  DFC  (never  underestimating 
though).  We  took  this  algorithm  and  estimated  the  DFC  of  as  many  kinds  of  non- 
random  functions  as  we  could  imagine.  This  involved  about  850  decompositions. 
Table  6.33  is  a  summary  of  all  the  decompositions  of  the  non-random  functions.  The 
DFC  column  is  an  average  when  multiple  runs  are  involved.  Table  6.34  shows  that 
larger  n  tends  to  have  greater  decomposability.  With  the  current  AFD  algorithm, 
we  are  just  able  to  decompose  functions  with  sufficiently  large  n  to  have  interesting 
patterns;  which  makes  these  results  all  the  more  remarkable. 

Recall  that  random  functions  do  not  decompose  with  high  probability.  In  all  850 
runs  there  were  only  about  20  functions  that  did  not  decompose.  As  you  look  over 
Table  6.33  notice  the  correlation  between  DFC  and  the  intuitive  complexity  of  each 
function.  We  believe  that  this  is  the  most  general  quantitative  measure  of  pattern-ness 
ever  proposed. 


136 


XOR 

7 

1 

128 

24 

18.8 

1 

count  four  ones 

7 

1 

128 

64 

50.0 

1 

binomial  coeff. 

8 

1 

256 

20 

7.8 

1 

Lucas  function 

8 

1 

256 

28 

10.9 

1 

greater  than 

8 

1 

256 

28 

10.9 

1 

palindrome 

8 

1 

256 

28 

10.9 

1 

images  of  chars 

8 

1 

256 

81.7 

31.9 

631 

logarithm 

8 

3 

768 

56 

7.3 

3 

square  root 

8 

4 

1024 

132 

12.9 

4 

GOD 

8 

4 

1024 

388 

37.9 

4 

modulus 

8 

4 

1024 

280 

27.3 

4 

remainder 

8 

4 

1024 

378 

36.9 

4 

addition 

8 

5 

1280 

92 

7.2 

5 

subtraction 

8 

6 

1280 

218 

17.0 

5 

1  bit  sorting 

8 

8 

2048 

488 

21.9 

8 

2  bit  sorting 

8 

8 

2048 

512 

25.0 

8 

multiplication 

8 

8 

2048 

892 

43.6 

8 

sine 

8 

8 

2048 

1350 

65.9 

8 

parity 

9 

1 

512 

32 

6.3 

1 

language  accept 

9 

1 

512 

46.6 

9.1 

14 

majority  gate 

9 

1 

512 

96 

18.8 

1 

Fibonacci  test 

9 

1 

512 

144 

28.1 

1 

cube  root 

9 

3 

1536 

168 

10.9 

3 

determinant 

9 

3 

1536 

270 

17.6  ■ 

3 

files 

9 

8 

4096 

1644.1 

40.1 

120 

primality  test 

10 

1 

1024 

600 

58.6 

1 

k- clique 

13 

1 

8192 

380 

4.6 

5 

Table  6.33:  Decomposition  Summary  for  Non-Randomly  Generated  Functions 


137 


Function 


multiplication 

multiplication 

multiplication 


primality  test 
primality  test 
primality  test 
primality  test 
primality  test 


Fibonacci  test 
Fibonacci  test 
Fibonacci  test 
Fibonacci  test 
Fibonacci  test 


binomial  coefF. 
binomial  coeff. 
Lucas  function 
Lucas  function 


majority  gate 
majority  gate 


n  m 


4  4 
6  6 
8  8 


6  1 

7  1 

8  1 
9  1 

10  1 


5  1 

6  1 

7  1 

8  1 
9  1 


6  r 
8  1 
6  1 
8  1 


7  1 

8  1 
9  1 


DFC 


40 

164 

892 


64 

104 

196 

336 

600 


24 

48 

76 

108 

144 


12 

20 

20 

28 


48 

96 


24 

28 

32 


128 

380 


%  DFC 


62.5 
42.7 

43.6 


100.0 

81.2 

76.6 
■  65.6 

58.6 


75.0 

75.0 

59.4 

42.2 

28.1 


4.7 

7.8 
31.2 
10.9 


37.5 

18.8 


18.8 

10.9 

6.3 


50.0 

4.6 


Number  of  runs 


4 

6 

8 


1 

1 

1 

1 

1 


1 

1 

1 

1 

1 


1 

1 

1 

1 


1 

1 


1 

1 

1 


4 

5 


138 


6.4  Patterns  as  Perceived  by  People 

In  this  section  we  consider  the  relationship  between  complexity  as  measured  by  a  func¬ 
tion  decomposition  algorithm  and  complexity  as  perceived  by  people  when  functions 
are  represented  as  two-dimensional  images. 

We  have  hypothesized  that  the  DFC  measure  captures  the  essence  of  pattern-ness 
in  general.  DFC  has  established  correlation  with  conventional  measures  such  as  time 
complexity,  program  length  and  circuit  complexity.  Since  people  are,  in  some  sense, 
a  computation  system,  we  wonder:  does  this  measure  correlate  with  complexity  as 
perceived  by  people?  The  AFD  program  provides  the  DFC  measure  of  complexity.  To 
get  human  assessments  of  the  complexity  of  these  functions,  we  turned  the  function 
into  something  which  can  be  easily  perceived.  There  are  a  variety  of  possible  ways  of 
doing  this.  An  image  can  be  created  from  a  function  as  discussed  in  Section  6.3.5.  A 
similar  method  could  be  used  to  produce  other  experiences  (e.g.  sounds). 

We  generated  a  set  of  test  functions,  found  the  DFC  complexity  using  the  AFD 
program,  found  the  complexity  as  perceived  by  people  and  assessed  the  relationship 
between  these  two  measures.  Thomas  Abraham  reports  the  results  of  this  study  in 
[1].  In  summary,  this  experiment  found  a  correlation  coefficient  of  0.8  between  DFC 
and  pattern-ness  as  ranked  by  people. 

6.4.1  Effect  of  the  Order  of  Variables  on  the  Pattern-ness 
of  Images 

At  various  points  throughout  the  PT  1  study  it  has  been  useful  to  think  of  binary 
functions  as  black  and  white  images.  This  technique  was  used  in  the  extrapolation 
experiments,  in  the  tests  for  DFC  generality  and  in  the  pattern-ness/DFC  correlation 
study  just  discussed. 

In  order  to  get  an  image  from  a  binary  function  one  must  choose  which  variables 
specify  rows  and  which  specify  columns.  You  must  also  decide  the  order  among  the 
row  variables  and  among  the  column  variables.  This  experiment  is  concerned  with  the 
effect  of  the  chosen  grouping  of  variables  on  the  appearance  of  the  images.  Table  6.35 
lists  the  images  used  in  this  experiment. 

We  used  the  Turbo  Pascal  character  set  (as  in  Section  6.3.5)  as  a  source  of  func¬ 
tions.  A  program  was  used  to  draw  a  specified  function  with  a  variety  of  variable 
permutations.  Figures  6.15  through  6.18  show  a  sequence  of  images  resulting  from 
the  program.  Each  figure  has  the  character  number,  the  font  number,  and  the  De¬ 
composed  Function  Cardinality  (DFC)  listed  at  the  bottom.  At  the  top  left  of  the 
figure  is  the  image  as  drawn  by  the  Turbo  Pascal  procedure  outtextxy.  At  top  right 
is  the  drawing  of  the  function  that  was  generated  by  “getting  pixels”  from  the  left 
image.  The  top  two  images  should  always  be  the  same.  The  order  of  the  variables 
for  the  top  row’s  images  can  arbitrarily  be  labeled  as  1,  2,  3,  4  for  the  columns  and  5, 
6,  7,  8  for  the  rows.  From  left  to  right,  the  second  row  of  images  are  a  permutation 
of  these  variables  as  in  Table  6.36.  Permutation  1  is  a  swapping  of  the  high  and  low 


139 


Font  No. 

Char  No. 

DFC 

0 

177 

4 

0 

197 

20 

0 

15 

36 

0 

1 

40 

0 

10 

44 

0 

65 

64 

2 

48 

80 

2 

51 

104 

2 

65 

120 

3 

65 

160 

3 

31 

256 

1 

65 

184 

4 

65 

256 

Table  6.35:  Character  Images 


mm  +  + 


m 

ft: 

m 

m 

m 

m 

E 

m 

HH 

± 

HU 

+ 

f 

Sr 

d' 

.tv 

eg 

m 

31 

II 

zz 

= 

m 

11 

H 

dB' 

:9: 

■■ 

S3 

m 

■* 

nil 

1 

= 

S8 

70. 

tfi 

iV 

it; 

lilt 

H' 

Chat  i  1 

root  1  0 

orci  4 

Chan  19? 

roni  1  0 

orci 20 

Figure  6.15;  Variable  Permutations  for  Characters  177  and  197  of  Font  0 


% 

%  m  ii  ^  sti  n  ^  1 

Char:l5  Font:0  OFC: 


©  © 

00  S  88  ©  Sis 

K!  !;•  «■=.?  _§  Q  mj  ?jS  B 

m  m  m  m  *i  w  m 

Chari  1  rortiaO  DTCt  40 


’? 


Figure  6.16:  Variable  Permutations  for  Characters  15  and  1  of  Font  0 


140 


0 


0 


08 

B 

m 

B  ^  B 

B 

c  n  r:-  0 

0- 

O'.' 

«) 

S 

m 

8K  SS  ^ 

iFn 

ITB 

HU 

m 

3  P-  n, 

ft 

•  1 ' 

ei 

m 

s 

HV  13  ii 

m 

11 

ss 

!"  |7-  p  :::■ 

.**r  t'". 

.*.  ft 

Chan  1  0 

rent  1 0  orci 11 

Chin  IS  roft*  1  2 

orct  so 

Figure  6.17:  Variable  Permutations  for  Characters  10  of  Font  0  and  48  of  Font  2 


-'n  i-- 

:iX  n  ”v  •«  >’?•  x:: 


0  0 

«  m  ©  sj?  ^  !p  ^ 

fcr,  fi  f?  B  ??  s?  s« 

S  SJS  ?.7  Sfi  iK  S-iJ  yi? 


Char  i  51  Tom  i  OrCi  I  04 


Charidl  font  Id  OrCi 25^ 


Figure  6.18;  Variable  Permutations  for  Characters  51  of  Font  2  and  31  of  Font  3 


Permutation 

Column  Variables 

Row  Variables 

Original 

1 

2 

3 

4 

5 

6 

7 

8 

1 

4 

3 

2 

1 

5 

6 

7 

8 

2 

1 

2 

3 

4 

8 

7 

6 

5 

3 

4 

3 

2 

1 

8 

7 

6 

5 

4 

5 

2 

3 

4 

1 

6 

7 

8 

5 

1 

6 

3 

4 

5 

2 

7 

8 

6 

1 

2 

7 

4 

5 

6 

3 

8 

7 

1 

2 

3 

8 

5 

6 

7 

4 

8 

1 

2 

7 

8 

5 

6 

3 

4 

9 

1 

6 

3 

8 

5 

2 

7 

4 

Table  6.36:  Permutations  of  Variables 


141 


order  bits  of  the  column  variables;  2  is  a  swapping  of  the  high  and  low  order  bits  of 
the  row  variables;  3  is  a  swapping  of  the  high  and  low  order  bits  of  both  the  column 
and  row  variables;  4  through  7  swap  a  column  variable  for  a  row  variable;  8  swaps  two 
side-by-side  column  variables  for  two  side-by-side  row  variables;  9  swaps  every  other 
column  variable  for  every  other  row  variable.  Note  that  all  the  font  0  characters  have 
at  least  two  vacuous  variables  while  none  of  the  characters  in  the  other  fonts  have 
any  vacuous  variables. 

This  data  shows  that  the  apparent  pattern-ness  of  a  given  function  is  highly  de¬ 
pendent  upon  the  variable  permutation.  For  example,  character  51  of  font  2  with 
the  original  combination  of  variables  appears  to  be  highly  patterned  while  all  of  the 
images  resulting  from  a  random  permutations  of  the  variables  do  not  appear  to  be 
patterned.  This  points  out  the  risk  in  trying  to  “see”  the  patterns  in  a  function  by 
turning  it  into  an  image.  There  is  at  least  one  definite  property,  “connected-ness” 
that  the  original  images  have  that  is  lost  in  the  random  permutations.  All  but  two 
of  the  original  images  (i.e.  all  but  characters  1  and  177  of  font  0)  are  made  up  of  no 
more  than  3  “islands.”  The  randomly  permuted  images  are  made  up  of  many  more 
of  these  islands.  We  argue  that  the  “real”  abstract  pattern-ness  of  all  the  images  of 
a  particular  function  is  the  same  and  that  the  difference  in  apparent  pattern-ness  is 
a  result  of  our  biological  visual  system. 


6.5  Pattern-ness  Relationships  for  Related  Func¬ 
tions 

When  we  defined  the  PT  1  problem  (Section  3.4.9)  we  chose  to  only  consider  functions 
defined  by  tables.  However,  we  eventually  will  consider  functions  defined  in  other 
ways.  This  raises  the  question,  what  does  the  pattern-ness  of  one  function  imply 
about  the  pattern-ness  of  a  related  function?  We  are  especially  interested  in  this 
question  when  one  function  is  used  to  define  a  second  function.  For  example,  we 
might  be  asked  to  compute  y  =  f{x)  where  x  =  What  does  knowing  how  to 
compute  X  from  y  tell  us  about  computing  y  from  x?  We  hope  to  study  this  question 
in  depth  in  PT  2;  however,  we  have  some  initial  results  that  are  described  in  this 
section. 


6.5.1  Functions  and  Their  Complements 

Perhaps  the  most  trivial  kind  of  relationship  is  when  one  function  is  simply  the 
complement  of  a  second  function.  If  /(x)  =  g{not{x))  or  not{g{x))  or  not{g{not{x))) 
then  /  and  g  have  the  same  DFC  (except  when  /  or  y  is  a  projection  function). 
Not  only  is  the  DFC  the  same,  so  is  the  architecture  of  the  decomposition  (i.e.  the 
algorithm).  Therefore,  if  we  are  asked  to  design  an  algorithm  for  /  and  /  is  defined 
for  us  in  the  form  of  an  algorithm  for  the  complement  of  /  then  our  job  is  very  simple. 


142 


6.5.2  Functions  and  Their  Inverses 


This  section  is  concerned  with  the  relationship  between  the  complexity  of  a  function 
and  the  complexity  of  that  function’s  inverse®  and  summarizes  [30]. 

One  motivation  for  studying  this  problem  comes  from  the  fact  that  many  com¬ 
putational  problems  are  originally  specified  in  terms  of  the  inverse  of  the  function 
which  we  want  to  realize.  For  example,  there  is  the  “classical  inverse  problem”  of 
computer  vision.  The  transformation  from  a  3-D  model  of  a  space  into  a  2-D  image 
of  that  space  from  a  particular  perspective  (Projective  Geometry)  is  well  understood 
and  of  relatively  low  computational  complexity.  The  computer  vision  problem  is  easy 
to  specify  using  Projective  Geometry.  However,  the  objective  in  computer  vision  is 
to  take  a  2-D  image  and  generate  from  that  a  3-D  model.  Therefore,  the  problem  is 
specified  using  a  function  (Projective  Geometry)  which  is  the  inverse  of  the  function 
that  we  wish  to  realize  on  a  computer.  A  second  example  is  the  situation  where  we 
want  to  find  an  x  with  a  particular  given  property  y.  Typically,  the  properties  of  x 
(say  P(x))  are  easy  to  compute  and  this  is  how  the  problem  is  specified.  However, 
the  inverse  of  P  is  needed  to  find  x  with  a  given  input  property  y. 

The  question  then  arises;  if  a  function  (/)  is  of  low  computational  complexity 
(which  is  generally  the  case  when  /  is  used  in  a  problem  specification)  then  should 
we  expect  the  inverse  of  /  to  have  low  computational  complexity.  In  other  words,  if 
a  problem  has  a  simple  specification  then  does  it  necessarily  have  a  simple  computer 
realization? 

Our  approach  to  studying  this  problem  was  to  generate  many  functions  and  their 
inverses  and  then  compute  their  coinplexity.  Since  we  are  studying  only  true  functions, 
we  required  the  functions  of  this  study  to  be  bijections.  That  is,  both  the  function 
and  its  inverse  are  true  functions. 

This  experiment  was  done  with  functions  of  the  form: 

/;  {0,1}' -.{0,1}'. 

There  are  actually  four  functions  going  each  way  and  the  “cost”  in  the  tables  are 
the  sum  of  the  DFC’s  of  the  four  functions.  We  decomposed  about  34,000  function  - 
inverse  pairs.  Table  6.37  shows  the  relationship  between  functions  with  a  given  DFC 
and  the  average  DFC  of  those  function’s  inverses.  Figure  6.19  shows  the  average  DFC 
relationship  between  functions  and  their  inverses. 

There  are  function-inverse  pairs  that  have  substantially  different  DFC’s,  e.g.  36 
versus  52  or  40  versus  56.  However,  they  tended  to  be  the  same  on  average.  In  par¬ 
ticular,  19.9  percent  of  the  functions  had  higher  cost  than  their  inverses,  20.2  percent 
had  lower  cost  and  59.9  percent  had  the  same  cost.  The  experimental  correlation 
coefficient  between  the  DFC  of  a  function  and  its  inverse  was  0.90.  We  observed  from 
the  detailed  data  that  if  one  (or  more)  of  the  functions  going  one  way  was  a  projection 
then  the  inverse  functions  would  also  include  the  same  number  of  projections.  This 
was  also  true  of  complements  of  projections, 

®This  experiment  was  performed  by  John  Lanfeenderfer. 


143 


f~^  Cost 

/  Cost 

34 

36 

38 

40 

42 

44 

46 

48 

50 

52 

54 

56 

58 

60 

62 

64 

34 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

36 

0 

0 

0 

0 

0 

1 

0 

0 

0 

3 

0 

0 

0 

0 

0 

0 

38 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

40 

0 

0 

0 

3 

0 

1 

0 

1 

0 

0 

0 

0 

0 

3 

0 

3 

42 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

44 

0 

2 

0 

2 

0 

3 

0 

6 

0 

5 

0 

3 

0 

5 

0 

9 

46 

0 

0 

1 

0 

1 

0 

2 

0 

2 

0 

0 

0 

0 

0 

0 

0 

48 

0 

0 

0 

2 

0 

4 

0 

41 

0 

20 

0 

32 

0 

74 

0 

60 

50 

0 

0 

0 

b 

1 

0 

4 

0 

39 

0 

0 

0 

0 

0 

0 

0 

52 

0 

0 

0 

1 

0 

9 

0 

28 

0 

93 

0 

124 

0 

229 

0 

306 

54 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

56 

0 

0 

0 

2 

0 

4 

0 

25 

0 

134 

0 

569 

0 

1057 

0 

2039 

58 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

60 

0 

0 

0 

0 

0 

13 

0 

57 

0 

248 

0 

1115 

0 

3180 

0 

5850 

62 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

64 

0 

0 

0 

0 

0 

5 

0 

55 

0 

309 

0 

2060 

0 

5907 

0 

25710 

Table  6.37:  Number  of  Functions  and  Inverses  with  a  Given  Cost  Combination 


DFC  of  f 

- Avg  Cost 

Figure  6.19:  Relationship  Between  Functions  of  a  Given  DFC  and  the  Average  DFC 
of  Their  Inverses 


6.6  Extrapolative  Properties  of  Function  Decom¬ 
position 

6.6.1  Introduction 

^  It  is  possible  to  divide  the  algorithm  design  problem  into  the  problem  of  defining  what 

function  you  want  the  algorithm  to  compute  (the  definition  problem)  and  the  prob¬ 
lem  of  getting  a  computer  to  compute  the  desired  function  with  the  given  computing 
►  resources  (see  [50]).  For  the  PT  1  project,  we  deliberately  set  aside  the  definition 

problem  to  focus  on  the  realization  problem.  However,  as  an  aside  we  did  some 
experiments  on  one  approach  to  the  definition  problem.  This  section  describes  the 
results  of  these  experiments.  Our  Ada  Function  Decomposition  (AFD)  program  is  set 
up  such  that  it  will  decompose  partial  functions.  A  partial  function  is  a  function  with 
some  of  the  outputs  undefined  or,  equivalently,  there  are  inputs  for  which  we  don’t 
care  what  the  computer  outputs.  It  turns  out  that  by  recomposing  a  decomposed 
partial  function  you  end  up  with  a  function  that  is  “less  partial.”  Therefore,  recom¬ 
posing  decomposed  partial  functions  is  an  approach  to  the  definition  problem.  The 
exploration  of  this  approach  will  be  the  main  topic  in  the  Pattern  Theory  2  project. 

How  you  approach  the  definition  problem  depends  upon  what  you  are  given  about 
the  problem.  One  of  the  most  common  forms  for  giving  information  about  a  function 
is  a  set  of  samples.  Sometimes  you  may  also  be  given  some  information  about  how  to 
extrapolate  these  samples  but  not  always.  There  are  a  many  ways  to  extrapolate  sam¬ 
ples  (Neural  Nets,  rule  induction,  fitting  polynomials,  nearest-neighbor,  etc.);  each 
way  is  based  upon  some  (often  implicit)  assumption  about  the  form  of  the  function. 
How  well  a  given  approach  works  depends  upon  the  validity  of  that  assumption.  The 
traditional  approaches  to  machine  learning  assume  a  relatively  large  amount  about 
the  function  and  therefore  have  a  relatively  narrow  range  of  successful  applications. 
Since  the  range  of  possible  solutions  is  small,  these  traditional  approaches  tend  to 
require  only  a  few  samples.  One  of  the  most  neglected  aspects  of  machine  learning 
research  is  that  of  sjiecifying  exactly  what  a  particular  approach  is  assuming  about 
the  function.  Our  approach  assumes  that  the  desired  function  has  low  computational 
complexity.  The  central  thesis  of  Pattern  Theory  (PT)  is  that  a  function  has  structure 
(i.e.  is  patterned)  if  and  only  if  it  has  low  computational  complexity.  Therefore,  the 
Function  Extrapolation  by  Recomposing  Decompositions  (FEED)  approach  to  ma- 
^  chine  learning  assumes  that  the  desired  function  is  structured,  but  does  not  require 

a  specific  kind  of  structure.  We  have  shown  that,  while  it  is  highly  unlikely  that  an 
♦  arbitrary  function  will  be  structured,  functions  of  interest  in  computing  tend  to  be 

structured.  The  FERD  approach  contrasts  the  traditional  approaches  where  they  not 
only  assume  that  the  desired  function  is  structured,  but  that  the  function  has  some 
specific  structure  (geometric,  syntactic,  etc.). 

An  important  principle  in  science  (the  Principle  of  Parsimony  or  the  idea  of  Oc¬ 
cam’s  Razor)  is  that  one  should  choose  the  simplest  theory  that  is  consistent  with 
the  experimental  results.  FERD  is  basically  using  this  same  principle  in  its  approach 


145 


to  machine  learning.  In  regards  to  human  learning,  there  is  much  debate  about  the 
roles  of  nature  (phylogeny)  and  nurture  (ontogeny).  In  terms  of  machine  learning, 
the  “nature”  may  be  characterized  as  the  set  of  assumptions  made  by  the  machine 
learning  system  and  the  “nurture”  as  the  sample  set  (c.f.  Figure  2.1).  Since  FERD 
assumes  nothing  about  the  specific  structure  of  the  function  to  be  learned  (only  that 
it  is  structured),  knowledge  learned  by  FERD  has  very  little  “nature”  content.  In  the 
trade-off  between  being  able  to  learn  lots  of  different  kinds  of  things  and  being  able 
to  learn  quickly,  FERD  is  unusual  in  that  it  sides  with  diversity. 

6.6.2  FERD  Experiments 

A  series  of  experiments  were  conducted  to  assess  the  performance  of  FERD.  We  took  a 
number  of  functions,  sampled  them,  decomposed  the  sampled  functions,  recomposed 
the  decompositions,  and  compared  the  recomposed  functions  to  the  originals.  The 
“symmetric”  function  outputs  one  if  and  only  if  the  input  has  exactly  four  I’s  in  it. 
The  other  functions  are  pretty  well  described  by  their  names. 

We  varied  the  number  of  samples;  but,  for  each  number  of  samples,  we  always  gen¬ 
erated  five  randomly  sampled  versions  of  the  original  function.  The  sampled  versions 
of  the  function  were  then  decomposed.  The  AFD  output  was  then  recomposed.  The 
recomposed  functions  were  then  compared  to  the  original  function.  We  computed  the 
average,  maximum  and  minimum  error  for  the  five  sampled  versions  of  the  original 
function  for  each  sample  set  size.  If  /  is  the  original  function  and  r  the  recomposed 
function  then  error  Cp  is  defined  as  the 

er  =  E  e(a:) 

where  e(®)  =|  f{x)  —  r{x)  \  when  r(s)  is  defined  and  e(x)  =  1/2  when  r(x)  is 
undefined.  The  results  of  these  experiments  are  shown  in  Figures  6.20  through  6.30. 

Each  of  these  graphs  has  two  vertical  scales.  The  left  scale  is  the  number 
of  differences  between  the  original  function  and  the  recomposed  function.  Average, 
minimum  and  maximum  errors  are  plotted  relative  to  this  scale.  The  recomposed 
functions  may  still  have  some  “don’t  cares”  even  after  recomposition.  The  average 
number  of  don’t  cares  in  the  recomposed  function  is  plotted  relative  to  the  left  scale 
as  “Avg  D-Cares.”  The  curve  labeled  “Chance”  is  the  average  number  of  errors  that 
would  result  from  randomly  filling  in  the  don’t  cares  of  the  original  sampled  function. 
One  would  hope  that  a  machine  learning  system  would  do  better  than  chance.  The 
right  scale  is  the  Decomposed  Function  Cardinality  (DFC)  of  the  sampled  functions. 
The  average  (labeled  “P-Cost”),  minimum  and  maximum  DFC’s  are  plotted  relative 
to  this  scale. 

Note  the  cost-error  relationship  of  these  curves.  There  do  not  seem  to  be  many 
cases  where  we  can  get  a  large  decrease  in  cost  for  a  small  increase  in  error.  If  this  is 
true  in  general  then  our  specialization  to  zero  error  realizations  (in  Chapter  3)  may 
not  be  as  great  a  loss  of  generality  as  we  thought. 


146 


Avii  n-Caios  Cii'mce  f  OosI  s  Mfi'  •  ••-i 


Figure  6.20:  Learning  Curve  for  XOR  Function 


Number  of  Samples 


“  Avg  Error  +  Min  Error  ‘  Max  Error  o  Min  Cos( 

Avg  0-Caros  Chance  P-Cosl  »  Max  CosI 

Figure  6.21:  Learning  Curve  for  Parity  Function 


0  20  40  60  80  100  120  140 

Number  of  Samples 


Avg  Error  +  Mm  Error  *  Wax  Error  o  Min  Cost 
Avg  0-Cares  Chance  F-Cos!  *  Max  CosI 

Figure  6.22:  Learning  Curve  for  Majority  Gate  Function 


147 


0  20  40  60  80  ICO  120  140 

Number  of  Samples 


—  Avg  Error  *  Min  Error  *  Max  Error  °  Min  Cost 

Avg  O-Cares  Chance  Avg  Cost  *  Max  Cost 

Figure  6.23:  Learning  Curve  for  a  Random  Function  with  Four  Minority  Elements 


errors  Cost 


0  20  40  60  80  100  120  140 

Number  of  Samples 


—  Avg  Error  *  Min  Error  *  Max  Error  °  Min  Cost 
Avg  0-Caros  Chance  P-Cosi  ‘  Max  Cost 

Figure  6.24:  Learning  Curve  for  the  Symmetric  Function 


UrtoiS  UoM 


0  20  40  60  80  100  120  140 

Number  of  Samples 

—  Avg  Error  '*  Min  Ettvi  ‘  Max  Error  °  Mm  P-Cosi 

Avg  t>«0ares  -a--  Chance  -<»-  P-CosI  *  Me/  P-Cosi 

Figure  6.25:  Learning  Curve  for  Primality  Test  on  Seven  Variables 


148 


Av9  Error  +  Min  Error  *  Max  Error  “  Min  Cost 
Avg  O-Cares  Chance  Avg  Coal  *  Max  Coal 


Figure  6.26:  Learning  Curve  for  Primality  Test  on  Nine  Variables 


Errors  Cost 


Avg  Error  Min  Error  *  Max  Error  “  Min  Coal 

Avg  O-Catea  Chance  P-Coal  r  Max  Coal 


Figure  6.27:  Learning  Curve  for  a  Random  Function 


Errors  Cost 


Number  of  Samples 


“  Avg  Error  +  Min  Error  *  Max  Error  °  Min  Coal 

Avg  0-Cares  Chance  Avg  Coal  *  Max  Cost 

Figure  6.28:  Learning  Curve  for  Font  1  “P” 


149 


Errors 


Cost 


Number  of  Samples 


Avg  E((or  *■  Mm  Eitor  •  Mox  Error  °  Min  Cos! 
Avg  0-Cares  Chargee  'P-Cosl  *  Max  Cosi 


Figure  6.29:  Learning  Curve  for  Font  1  “T” 


Errors  Cost 


Avg  Error  ^  Min  Error  •  Max  Error  °  Min  Cost 
Avg  0-Cares  Ctianco  Avg  cost  *  Max  Cost 


Figure  6.30:  Learning  Curve  for  Font  0  “R” 


150 


The  input  function 
with  donT  ceres 


The  recomposed 
function 


20  samples. 
108  donT  cares 


I 


25  samples, 
103  donT  cares 


SOumples, 
MdottT  cares 


Figure  6.31:  Learning  Examples  for  the  Parity  Function 

A  sample  of  the  results  are  also  shown  in  picture  form  for  the  parity  function  and 
the  letter  “R”  in  Figures  6.31  and  6.32.  Figure  6.33,  showing  FERD  results  for 
“P”  and  “T”,  is  the  Pattern  Theory  logo.  Figure  6.34  shows  the  error  statistics  for  a 
random  function  on  each  of  4  through  10  binary  variables.  Only  one  sampled  version 
of  each  original  function  for  each  sample  set  size  was  made  for  this  graph. 

The  results  of  these  experiments  were  very  consistent.  As  the  DFC  of  the  original 
function  goes  up,  the  number  of  samples  required  for  a  given  error  rate  goes  up. 
That  is,  complex  functions  are  harder  to  learn.  This  is  demonstrated  in  Figure  6.35, 
which  shows  the  minimum  number  of  samples  required  for  less  than  10  errors  as  a 
function  of  DFC.  For  random  functions,  FERD  did  no  better  than  chance.  Also,  the 
error  rate  tended  to  go  down  in  rough  proportion  to  the  increase  in  the  DFC  of  the 
learned  function.  Functions  with  a  definite  minority  of  minority  elements  (such  as  the 
symmetric  function,  RND40NES,  and  PRIMES  9)  tended  to  have  a  plateau  in  their 
error  curves.  A  random  selection  of  samples  would  contain  few  minority  elements. 
We  would  not  expect  to  see  this  plateau  if  we  had  balanced  the  samples  between 
function  elements  with  an  output  of  0  and  those  with  an  output  of  1. 


Figure  6.32:  Learning  Examples  for  the  Letter  "R 


PHnERN  THEORV 


Figure  6.33:  The  Pattern  Theory  Logo 


- n*4 


F-raetion  of  Cares 

n*5^  n»6  “®“  n*7  n?8  n*9. 


-A-  nMO 


Figure  6.3,4:  Learning  Gurvei  for  Randp?a  Functions  on  4  Through  10,  Ywit^Wes 


Figure  6.35:  Number  of  Samples  Required  for  <  10  errors 


Errors 


Figure  6.36:  Neural  Net  Learning  Curve  for  XOR  Function 


Errors 


Figure  6.37:  Neural  Net  Learning  Curve  for  Parity  Function 
6.6.3  FERD  and  Neural  Net  Comparisons 

We  repeated  some  of  the  above  experiments  using  a  neural  net  (NN)  rather  than 
FERD.^  The  neural  net  had  three  layers  with  n  nodes  in  the  first  layer,  2n  + 1  nodes 
in  the  second  layer  and  1  node  in  the  final  layer.  The  weights  for  the  input  layer 
were  fixed  and  the  weights  for  the  other  layers  were  trained  using  back-propagation 
as  defined  in  [10,  pp.53-59].  These  results  are  shown  in  Figures  6.36  through  6.39. 
Measuring  the  performance  of  a  neural  net  is  not  as  black  and  white  as  in  the  FERD 
approach.  Neural  nets  have  a  number  of  user  specified  parameters  and  architectural 
features  that  affect  their  performance.  These  parameters  are  used  as  a  way  for  a 
human  to  get  more  “nature”  into  the  machine  to  get  it  started  learning.  Therefore,  it 
may  well  be  possible  to  fine  tune  some  neural  net  to  get  better  performance  than  we 

^Thomas  Gearhart  performed  most  of  these  experiments. 


154 


Errors 


Figure  6.38:  Neural  Net  Learning  Curve  for  Majority  Gate  Function 


Errors 


Figure  6.39;  Neural  Net  Learning  Curve  for  the  Symmetric  Function 


20 

40 

60 

80 

100 

120 

Total 

Total  % 

Function 

F 

N 

F 

N 

F 

N 

F 

N 

F 

N 

F 

N 

F 

N 

a 

mm 

Symm 

32 

27 

25 

19 

El 

4 

10 

2 

4 

2 

1 

83 

76 

10.8 

9.9 

MaJGate 

37 

46 

26 

26 

0 

6 

0 

3 

0 

0 

75 

92 

9.8 

12.0 

Parity 

64 

63 

0 

46 

0 

18 

0 

11 

0 

15 

64 

184 

8.3 

24.0 

XOR 

4 

23 

0 

1 

0 

1 

0 

0 

0 

0 

0 

0 

4 

25 

0.5 

3.3 

Total 

137 

159 

51 

92 

30 

58 

4 

34 

2 

18 

2 

16 

226 

337 

7.4 

11.0 

Table  6.38:  FERD  (F)  and  NN  (N)  Error  Comparison 

got  here;  but  since  FERD  requires  no  fine  tuning,  we  felt  this  was  a  fair  comparison. 
Table  6.38  compares  the  errors  from  the  two  approaches.  For  the  Symmetric  function 
the  neural  net  did  slightly  better  than  FERD  (9.9  percent  average  error  versus  10.8 
percent);  but  on  all  other  functions,  the  Neural  Net  did  worse,  sometimes  much 
worse.  The  lack  of  generality  of  the  Neural  Net  approach  is  seen  in  its  performance 
on  the  parity  function.  Here  the  Neural  Net  did  no  better  than  chance  while  FERD 
learned  the  function  exactly  with  30  (of  128  total  points)  samples.  Over  all  functions 
and  sample  set  sizes,  the  NN  had  an  average  error  rate  of  11  percent  compared  to 
FERD’s  7.4  percent.  Note  that  a  random  extrapolation  of  any  function  would  result 
in  a  composite  average  error  rate  of  about  25  percent.  FERD  got  the  function  exactly 
right  in  13  cases  compared  to  4  for  the  NN. 

Another  limitation  of  Neural  Nets  became  apparent  when  we  implemented  a  sec¬ 
ond  NN,  this  one  had  n  nodes  in  the  first  layer,  10  nodes  in  the  second  layer,  5  nodes 
in  the  third  layer  and  1  node  in  the  final  layer.  The  back-propagation  implemen¬ 
tation  of  this  NN  was  taken  from  [53].  The  learning  curves  for  this  NN  are  shown 
in  Figure  6.40  and  Figure  6.41.  The  points  at  0  and  128  samples  do  not  represent 
actual  data.  Figure  6.40  demonstrates  that  this  NN  worked  well  for  a  step  function. 
Notice  that  although  the  first  NN  performed  well  on  the  Majority  Gate  function,  this 
second  NN  showed  no  consistency.  This  points  out  that  selecting  an  architecture  for 
the  net  is  important.  However,  this  selection  is  left  to  the  designer  with  no  theory  to 
say  what  architecture  is  appropriate  for  a  given  class  of  functions.  This  contrasts  the 
FERD  approach,  which  “solves”  for  the  architecture  as  a  function  of  the  data. 

6.6.4  FERD  Theory 

We  were  surprised  that  the  Neural  Net  did  better  than  chance  and  even  more  surprised 
that  FERD  did  better  than  a  Neural  Net.  Our  best  theoretical  explanation  for  this  is 
based  on  the  probability  that  samples  have  a  certain  degree  of  structure.  Reference 
[33,  p.l94]  identifies  Laplace  and  Kolmogorov  as  having  recognized  that  regularity 
consistent  with  a  simple  law  probably  is  a  result  of  that  simple  law. 

Let  F  be  the  set  of  all  binary  functions  on  n  variables.  Rather  than  use  the  simple 
DFC,  our  cost  measure  is  program  length  as  defined  in  Section  4.3.  This  cost  is  es- 


156 


Avg  Error  +  Min  Error  •  Max  Error 

Chance  -*♦-  Lower  Limit 


Figure  6.40;  Second  Neural  Net’s  Learning  Curve  for  the  Step  Function 


Errors 


'*0  60  80  100  120  140 
Number  of  Samples 


r'xg  Error  +  Min  Error  •  Max  Error 

Chance  Lower  Limit 

Figure  6.41:  Second  Neural  Net’s  Learning  Curve  for  the  Majority  Gate  Function 


157 


sentially  DFC  with  something  added  to  reflect  the  complexity  of  the  interconnections 
of  the  decomposition’s  components.  There  is  some  function  (/)  that  we  want  to  learn 
and  we  have  a  set  of  samples  from  /.  There  are  two  subsets  of  F  that  are  of  special 
interest.  One  subset  [S)  is  the  set  of  all  the  functions  consistent  with  the  samples. 
The  other  is  the  set  (C)  of  all  functions  with  a  cost  less  than  or  equal  to  the  cost  of 
/.  The  function  /  is  in  both  these  subsets  and  FEED  is  such  that  it  always  produces 
a  learned  function  {g)  that  is  also  in  both  these  subsets.®  That  is,  /  and  g  are  always 
consistent  with  the  samples  and  will  have  a  cost  that  does  not  exceed  /’s.  Therefore, 
the  size  of  the  intersection  of  these  two  sets  tells  us  something  about  how  far  g  can 
be  from  /.  In  particular,  when  the  intersection  only  has  one  member,  g  =  f  (i.e. 
FEED  gets  it  exactly  right).  We  use  [.4]  to  denote  the  cardinality  of  a  set  A.  Let  us 
summarize  our  notation: 

•  Fi  the  set  of  all  functions  of  the  form  /  :  {0, 1}"  {0, 1}  ,  [F]  =  2^^"^ 

•  /:  some  function  in  F  that  we  want  to  learn. 

•  c:  the  cost  of  /. 

•  s:  the  number  of  samples  that  we  are  given  from  /. 

•  5:  the  set  of  functions  from  F  that  are  consistent  with  the  samples.  [S']  = 

•  Cl  the  set  of  functions  from  F  that  have  cost  less  than  or  equal  to  c.  [C7]  <  2® 
since  we  can  name  all  the  elements  in  C  with  c  bits  (see  Theorem  A. 8)  . 

•  gi  the  function  produced  by  FEED  from  the  samples  of  /. 

The  hypergeometric  probability  distribution  [41,  pp. 175-176]  is  of  interest  here. 
A  random  variable  X  has  a  hypergeometric  distribution  with  parameters  N,  n  and  r 

(  r\(  N-r\ 

/  V  \  k  j  \  n  —  k  j 

P{X  =  k)  =  y- . - A!  =  0,1,2,... 


Meyers’70  gives  the  following  useful  properties  of  the  hypergeometric  distribution. 
Let  p  —  r/N  and  q=-\  —  pi 


•  E{X)  =  np, 

•  V{X)  =  npq{N  -n)/{N  -1), 


P{X  =  k) 


^*^(1  —  p)"  for  large  N. 


^Figures  6.21  and  6.29  have  points  where  the  cost  of  an  extrapolated  function  exceeds  the  original 
function.  This  is  a  result  of  the  non-optimal  nature  of  the  software  used  and  should  not  happen  in 
principle. 


158 


I* 


The  last  property  indicates  that  the  hypergeometric  distribution  is  well  approximated 
by  the  binomial  distribution  when  N  is  large.  A  binomial  distribution  with  the  above 
parameters  has  expected  value  np  and  variance  npq. 

Given  two  subsets  A  and  B  of  a  universal  set  U,  if  A  and  B  are  randomly  and 
independently  selected  from  U  then  the  number  of  common  elements  has  the  hyper- 
geometric  distribution  with  parameters  [£^],  [A]  and  [5};  that  is,  for  X  the  size  of 


ACiB, 


P{X  =  k)  = 


/'1-41  \  nt;)  -  M 

U  j  I  IBl-fe  }  , 


(13) 


,^5  —  0,1,2, . 


Now  let  p  =  [Aj/ff/]  and  q^l—p: 


.  E{X)  =  [B]p, 


.  v{x) = [B]pq{[u]  -  m/m  - 1), 

Returning  now  to  our  problem  and  our  notation,  if  C  and  S  are  independently 
and  randomly  selected  subsets  of  F  then  the  cardinality  of  C  intersect  S  has  a  hy¬ 
pergeometric  distribution.®  That  is,  for  X  =  [5n  G], 


P{X  =  k)  = 


0,1,2,... 


Now  let  p  =  (5]/[F]  and  q—l—p: 

•  E{X)  =  [C]p  =  (C]2(2"-^)/2(2")  <  2'=2(2"-")/2(2")  =  2"-% 

.  V{X)  =  [C]pq{[F]  -  [C7])/([F]  -  1)  ~  [C]pq  <  2-*(l  -  2-)  for  large  [F]  and 
V{X)  ~  for  large  S. 

Therefore,  when  s  >  c  we  expect  C  and  S  to  share  only  one  function,  namely  / 
(which  in  this  case  must  equal  g). 

When  5  <  c  we  would  expect  /  and  g  to  be  in  the  same  relatively  small  subset. 
This  means  that  there  is  some  chance  of  g  equalling  /.  However,  we  cannot  explain 
why,  when  g  is  not  exactly  /,  g  tends  to  by  closer  to  /  than  chance.  Apparently  the 
cost  neighborhood  of  a  function  and  the  geometric  neighborhood  of  a  function  are 
related  (i.e.  they  overlap  by  more  than  chance).^® 

®Mike  Breen  recognized  the  relationship  between  this  problem  and  the  hypergeometric 
distribution. 

'“The  parity  function  is  an  exception  to  this.  That  is,  for  the  parity  function,  the  error  is  worse 
than  chance  until  there  are  sufficient  samples  to  extrapolate  the  function  exactly.  Also,  the  Neural 
Net  had  trouble  with  the  parity  function. 


159 


It  seems  that  S  is  controlled  by  “geometric”  distance  while  C  is  controlled  by 
cost  differences.  What  is  the  relationship  between  cost  and  geometric  distance  and 
how  does  this  relate  to  FERD?  We  first  develop  an  upper  bound  on  cost  distance 
as  a  function  of  geometric  distance.  Second,  we  draw  some  conclusions  about  the 
“topology”  of  the  space  of  functions  with  respect  to  cost.  Finally,  we  re-interpret 
some  earlier  experimental  results  in  a  way  that  applies  to  this  problem. 

We  can  relate  the  geometric  and  cost  distance  between  two  functions  by,  in  effect, 
using  the  computation  of  one  function  to  compute  the  other.  That  is,  we  compute  the 
more  costly  function  by  first  computing  the  less  costly  function  and  then  computing 
the  differences  between  the  two  functions.  Define  the  geometric  distance  (dg)  between 
two  Boolean  functions  (/  and  g)  on  n  variables  as: 

W^9)=  E  I /(»)  -  I  • 

This  is  equivalent  to  the  number  of  points  in  which  /  and  g  differ.  Define  the  cost 
distance  [dc)  between  /  and  g  as: 

d,{f,g)=\DFG(f)-DFC(g)  |, 

where  DFC{f)  means  the  Decomposed  Function  Cardinabty  of  /. 

Theorem  6.9  If  dg{f,g)  <  p  then  dc{f,g)  <  inp. 

Proof: 

Assume  that  dg{f,g)  <  p.  Suppose  /  is  the  less  expensive  of  the  two  functions.  We 
can  compute  g  by  first  computing  /  (with  cost  DFC{f)),  computing  the  p  minority 
elements  where  /  and  g  differ  (with  cost  4(n  —  1)  each,  or  a  total  of  4p(n  ~  1))  and 
then  summing  /  and  the  p  minority  elements  together  (with  cost  4(A  —  1),  where 
A:  =  p  +  1  is  the  number  of  variables  to  be  summed,  for  a  total  cost  of  4p).  The  DFC 
of  g  therefore  must  not  be  greater  than  the  cost  of  the  above  computation,  which  is 
DFC{f)  +  4p(n  -  1)  +  4p  =  DFC{f)  +  inp.  That  is,  DFC{g)  <  DFC{f)  +  inp. 
a 

Theorem  6.10  Ifdg{f,g)  >  2"  —  p  then  dc{f,g)  <  inp  +  2. 

Proof: 

Note  that  dg{f,g)  >  2"  —  p  implies  dg{f,not{g))  <  p,  since  if  /  and  g  disagree  on  all 
but  p  points  then  /  and  not{g)  agree  on  all  but  p  points.  We  can  compute  not{g)  as 
above  and  then  complement  it  with  a  cost  of  2. 

□ 


If  neither  /  nor  g  have  cost  <  2  then  Theorem  6.10  has  the  sbghtly  stronger  form 
of  Theorem  6.9:  if  dg{f,g)  >  [/]  —  p  then  dc(/,p)  <  4np,  since  the  cost  of  a  function 
with  cost  more  than  2  is  the  same  as  its  complement. 


160 


Theorem  6.9  tells  us  that  functions  within  a  local  geometric  neighborhood  are 
also  within  a  local  cost  neighborhood.  The  converse  of  Theorem  6.9  is  not  true. 
Theorem  6.10  points  out  a  certain  kind  of  symmetry  in  the  cost  of  functions  as  you 
go  from  one  point  in  the  dg  space  of  functions  to  its  opposite  point.  The  relationship 
between  dg  and  dc  is  analogous  to  the  temperature  distribution  on  the  earth.  The 
temperature  everywhere  inside  this  county, is  within  a  degree  or  two  of  the  temperature 
here.  However,  when  you  consider  the  places  that  have  temperatures  about  the  same 
as  here,  you  might  have  to  include  places  as  far  away  as  Europe  or  Asia.  We  think  that 
the  set  of  low  cost  functions  form  many  small  separated  regions  within  the  geometric 
space  of  functions.  We  can  think  of  the  learning  situation  as  one  where  S  specifies 
a  fairly  large  connected  geometric  region  and  C  specifies  a  bunch  of  small  separate 
geometric  regions.  The  geometric  regions  formed  by  S  are  “sub-spaces.”  For  example, 
if  F  were  3-D  then  S  might  be  a  plane  or  a  line.  The  intersection  of  S  and  C  then 
might  typically  be  just  a  sub-space  of  one  of  these  small  pockets  of  low  cost.  FERD, 
in  effect,  picks  one  of  the  lowest  cost  elements  from  S  intersect  C.  Looking  at  the 
learning  curves,  we  see  that  as  more  and  more  samples  are  provided  the  minimum 
cost  function  has  higher  and  higher  cost.  Also,  when  the  cost  of  the  minimum  cost 
function  gets  up  to  the  cost  of  /,  /  is  almost  always  the  function  FERD  selects.  It 
seems  the  only  way  there  would  be  more  than  one  function  of  minimum  cost  is  if  the 
sub-space  defined  by  S  runs  along  a  constant  cost  line  (surface  or  whatever)  rather 
than  crossing  it.  Apparently  the  odds  of  that  happening  are  not  great. 

The  expected  size  of  S  intersect  C  is  2*^"*.  This  result  depends  upon  the  assump¬ 
tion  that  S  and  C  are  randomly  and  independently  selected  from  F.  Let  us  examine 
that  assumption.  Independence  requires  the  distribution  of  functions  with  respect  to 
cost  within  S  to  be  the  same  as  within  as  a  whole  (as  in  P{C  \  S)  =  P{C)).  The 
first  part  of  this  section  showed  that  within  a  small  dg  neighborhood  independence 
does  not  hold.  However,  S  is  not  small  and  its  not  really  a  neighborhood  either.  We 
know  that  all  the  functions  in  S  are  the  same  on  the  sample  set  so  their  dg  is  no  more 
than  (/]  —  s.  However,  in  real  world  learning  applications,  a  is  very  small  compared  to 
[/].  That  is,  S  is  very  large.  For  any  function  fi  in  S  there  exists  a  function  /2,  not  is 
5,  such  that  ^^(/i,/?)  =  1;  that  is,  they  differ  only  on  some  point  in  the  sample  set. 
That  is,  S  is  not  what  we  would  think  of  as  a  neighborhood.  Therefore,  Theorem  6.9 
does  not  preclude  independence  between  C  and  S  for  realistic  situations. 

There  is  some  evidence  suggesting  that  S  and  C  are  sufficiency  independent.  This 
evidence  comes  from  the  minority  element  experiments  (Sectio"  6.2.3).  We  studied 
the  relationship  between  the  number  of  minority  elements  in  a  function  and  the 
function’s  DFC.  It  appears  that  if  the  number  of  minority  elements  is  greater  than 
about  20  percent  of  the  size  of  /  then  the  cost  distribution  is  random  (at  least  the 
average  DFC  is  about  the  same  as  a  random  distribution).  The  number  of  minority 
elements  in  a  function  is  the  same  as  its  dg  from  a  constant  function.  Also,  since  the 
DFC  of  a  constant  function  is  zero,  the  DFC  of  a  function  is  the  same  as  its  dc  from  a 
constant  function.  Therefore,  the  DFC  versus  number-of-minority  elements  data  can 
be  thought  of  as  dg  versus  dc  data  relative  to  a  constant  function.  If  /  is  a  constant 


function  and  3  is  less  than  60  percent  of  [/]  then  S  looks  pretty  much  like  the  set  of 
random  functions  with  greater  than  20  percent  minority  elements.  Therefore,  if  3  is 
less  than  60  percent  of  [/],  which  it  almost  always  will  be,  then  the  cost  distribution 
on  S  is  pretty  much  random.  This  means  that  S  and  C  are  reasonably  independent 
under  the  expected  conditions.  Intuitively,  you  would  think  that  you  would  have 
to  get  further  from  a  constant  function  than  any  other  function  before  things  look 
random.  Therefore,  this  conclusion  that  S  has  a  random  cost  distribution  relative  to 
constant  functions  is  even  more  likely  to  be  true  for  other  functions. 

6.6.5  Summary 

In  summary,  FERD  is  an  approach  to  function  extrapolation  that  chooses  the  least 
computationally  complex  function  that  is  consistent  with  the  samples.  In  effect, 
FERD  chooses  a  function  from  the  intersection  of  the  set  of  low  cost  functions  {C) 
and  the  set  of  functions  consistent  with  the  samples  (5).  Under  an  assumption  of 
independence  of  C  and  5,  the  expected  value  of  [Cn5]  is  2'^“*.  We  see  some  evidence 
that  this  assumption  may  not  be  too  bad  in  practice.  Anecdotal  comparisons  of  FERD 
and  Neural  Nets  suggest  that  FERD  has  considerable  potential  as  an  extrapolation 
method.  PT  2  will  focus  on  FERD. 

6.7  Summary 

This  chapter  reports  on  the  results  of  looking  at  the  world  from  a  Pattern  Theory 
perspective.  We  found  that  randomly  generated  functions  have  high  DFC  and  that 
a  very  wide  range  of  patterned  functions  (numeric,  symbolic,  string,  graphic,  images 
and  files)  have  low  DFC.  There  is  high  correlation  between  pattern-ness  as  measured 
by  DFC  and  as  perceived  by  people.  There  is  also  high  correlation  between  DFC 
and  the  compression  factor  for  files.  These  results  support  the  contention  that  DFC 
is  a  measure  of  the  essential  pattern-ness  of  a  function.  We  also  found  the  function 
decomposition  has  remarkably  general  extrapolative  properties.  This  further  supports 
the  idea  of  a  fundamental  importance  of  decomposition. 


162 


Chapter  7 


Conclusions  and 
Recommendations 


The  theory  of  computing,  as  taught  in  most  universities,  fails  the  engineer  in  sev¬ 
eral  ways.  First,  it  makes  much  ado  over  something  that  is  practically  trivial,  i.e. 
that  there  exists  infinite  functions  that  cannot  be  computed  with  finite  machines 
(Appendix  A).  Second,  the  treatment  of  complexity  results  in  a  statement  that  is 
practically  not  acceptable;  i.e,  that  all  finite  functions  have  the  same  complexity  (0(1) 
time  complexity)  (Section  4.4).  Finally,  and  most  importantly,  the  theory  does  not 
support  design. 

This  report  introduces  a  computing  paradigm  that  we  hope  will  support  the  en¬ 
gineering  requirements.  Note  that  .most  of  its  elements  (the  decomposition  condition 
[4],  the  relationship  between  combinational  and  time  complexity  [64],  the  theor;^  of 
program  length  [12])  are  known  in  the  computing  theory  community,  but  lack  the 
deserved  emphasis.  If  this  report  offers  anything  new  to  computing  theory  it  is  the 
idea  of  Decomposed  Function  Cardinality  (DFC)  as  a  general  measure  of  essential 
computational  complexity.  DFC  correlates  very  well  with  the  intuitive  expectations 
for  a  general  measure  of  pattern-ness.  We  ran  the  AFD  program  on  a  wide  variety 
of  patterned  functions.  Recall  that  randomly  generated  functions  do  not  decompose 
with  high  probability.  Therefore,  it  is  quite  remarkable  that  of  the  roughly  850  func¬ 
tions  that  we  did  not  generate  randomly,  only  about  20  did  not  decompose.  We  feel 
that  the  generality  of  DFC  as  a  measure  of  pattern-ness  has  been  well  demonstrated. 

We  can  also  start  to  see  the  potential  for  practical  benefits  from  Pattern  Theory. 
We  have  been  working  with  what  might  be  called  “toy”  problems  (i.e.  binary  functions 
on  no  more  than  10  or  so  variables).  However,  we  believe  that  we  should  expect  to  be 
able  to  do  algorithm  design  on  toy  problems  before  more  general  problems.  Note  that 
this  is  opposite  from  the  approach  taken  in  most  machine  learning  paradigms.  The 
common  view  in  machine  learning  is  that  we  solve  the  easy  problems  by  hand  and 
when  we  get  to  problems  that  we  cannot  solve  by  hand  such  as  the  character  recogni¬ 
tion  (cannot  solve  the  definition  problem)  or  the  traveling  salesman  problem  (cannot 
solve  the  realization  problem)  then  we  try  to  get  a  machine  to  solve  it.  We  think  that 
we  should  be  able  to  machine  learn  (or  machine  design  algorithms  for)  easy  problems 


163 


long  before  we  can  machine  learn  the  hard  ones.  Each  time  the  AFD  program  found 
that  a  function  had  a  low  DEC  (which  it  did  some  830  times),  it  also  found  an  al¬ 
gorithm  for  that  function.  Therefore,  although  we  were  limited  to  toy  problems,  we 
have  demonstrated  machine  designed  algorithms  in  a  very  general  setting.  From  a 
machine  learning  perspective,  FERD  performed  as  well  as  a  Neural  Net  on  functions 
well  suited  to  Neural  Nets  and  considerably  outperformed  the  Neural  Net  on  other 
functions.  Also,  unlike  most  traditional  machine  learning  paradigms,  we  can  say  ex¬ 
actly  what  property  a  function  must  have  to  be  amenable  to  FERD  extrapolation. 
That  property  is  low  computational  complexity  in  the  sense  of  DFC.  Recall  that  DFC 
is  a  strikingly  general  sense  of  complexity  and  we  begin  to  appreciate  the  significance 
of  FERD.  Similarly,  from  a  data  compression  perspective,  function  decomposition 
performed  comparably  to  hand-crafted  compression  algorithms  on  typical  files  and 
considerably  outperformed  them  on  atypical  files.  Therefore,  we  feel  that  Pattern 
Theory  represents  a  solid  foundation  for  an  engineering  theory  of  computing. 

The  results  of  PT  1  clearly  support  the  need  for  further  exploration  and  develop¬ 
ment  of  Pattern  Theory.  Although  PT  1  dealt  with  toy  problems,  we  went  from  a 
general  statement  of  the  real  problems  to  the  PT  1  problem  in  a  series  of  deliberate 
partitions  and  simplifying  assumptions.  Therefore,  our  recommendations  for  future 
developments  of  Pattern  Theory  are  simply  to  revisit  each  decision  in  defining  the 
PT  1  problem  and  consider  the  implications  of  not  making  that  decision. 


164 


Chapter  8 
Summary 


Algorithm  technologies  are  very  important  to  the  Avionics  Directorate  because  they 
offer  a  cost-effective  means  to  realize  adaptive  and  fault  tolerant  avionics  capabilities. 
There  is  an  un-met  need  for  a  theory  of  algorithm  design.  We  only  have  to  consider 
the  power  of  Estimation  Theory  or  Control  Theory  to  realize  the  importance  of  an 
engineering  theory.  This  report  proposes  and  discusses  “Pattern  Theory”  as  a  basis 
for  an  engineering  theory  of  algorithm  design. 

Pattern  Theory  (PT)  begins  with  a  very  general  statement  of  the  algorithm  design 
problem.  This  general  problem  covers  virtually  every  other  discipline  that  results  in 
a  computational  system,  including  Neural  Networks  (NN)  and  model-based  vision. 
Deliberate  specializations  to  this  general  problem  are  made  to  arrive  at  the  problem 
treated  in  this  report  (the  PT  1  problem).  It  is  important  (and  unusual)  to  state 
up-front  how  a  specialized  approach  relates  to  the  general  problem. 

The  problem  of  finding  a  pattern  in  a  function  is  the  essence  of  algorithm  design. 
The  key  to  PT  is  its  measure  of  pattern-ness:  Decomposed  Function  Cardinality 
(DFC).  The  thesis  is  that  low  DFC  indicates  pattern-ness.  This  measure  is  uniquely 
general  in  reflecting  the  low  complexity  that  is  associated  with  patterns.  The  proper¬ 
ties  of  DFC  are  defined  and  developed  with  mathematical  rigor.  The  principal  result 
of  PT  1  is  a  demonstration  of  the  generality  with  which  DFC  measures  pattern-ness. 

The  generality  of  DFC  is  supported  theoretically  by  57  proven  theorems.  DFC 
has  the  property  that  if  a  function  is  computationally  non-complex  relative  to  time 
complexity  or  circuit  complexity  then  it  is  necessarily  non-complex  with  respect  to 
DFC,  while  the  converse  is  not  necessarily  true.  The  previous  statement  is  not  true 
relative  to  program  length  since  one  can  always  assign  an  arbitrarily  short  program 
to  any  function.  However,  DFC  has  the  property  that  its  average  is  no  more  than 
a  single  bit  greater  than  the  average  program  length  and,  when  decompositions  are 
encoded  as  programs,  the  program  length  cannot  be  large  without  DFC  also  being 
large.  By  rigorously  relating  DFC  to  time  complexity,  program  length  and  circuit 
complexity  it  can  be  assured  that  the  class  of  patterns  defined  by  DFC  includes  the 
classes  of  patterns  that  would  be  defined  by  these  conventional  measures.  In  doing 
this,  new  results  were  also  developed  in  the  theory  of  program  length  (Appendix  A). 

PT  1  explored  function  decomposition  algorithms  as  a  means  to  find  DFC  pat- 


165 


terns.  We  developed  a  general  test  (the  Basic  Decomposition  Condition),  based  on 
DFC,  for  whether  or  not  a  function  will  decompose  for  a  given  partition  of  its  vari¬ 
ables.  This  test  was  then  used  in  algorithms  (collectively  known  as  the  Ada  Function 
Decomposition  (AFD)  algorithms)  with  a  variety  of  heuristics  for  limiting  the  depth 
of  the  search  through  iterative  decompositions.  AFD  produces  a  decomposition  (i.e. 
an  algorithm  in  combinational  form)  and  the  DFC  of  a  function.  We  did  not  find 
the  hoped  for  threshold  where  additional  searching  had  no  payoff.  However,  we  did 
find  that  the  most  restrictive  search  method  came  very  close  to  performing  as  well  as 
the  least  restrictive  search  method,  despite  the  several  orders  of  magnitude  greater 
run-time  of  the  less  restricted  search.  There  was  especially  little  benefit  in  the  larger 
search  when  the  function  being  decomposed  was  highly  patterned. 

The  generality  of  DFC  was  also  supported  experimentally.  Over  800  non-randomly 
generated  functions  were  tested  including  many  kinds  of  functions  (numeric,  symbolic, 
string  based,  graph  based,  images  and  files).  Roughly  98  percent  of  the  non-randomly 
generated  functions  had  low  DFC  (versus  less  than  1  percent  for  random  functions). 
The  2  percent  that  did  not  decompose  were  the  more  complex  of  the  non-randomly 
generated  functions  rather  than  some  class  of  low  complexity  that  AFD  could  not 
deal  with.  It  is  important  to  note  that  when  AFD  says  the  DFC  is  low,  which  it  did 
some  800  times,  it  also  provides  an  algorithm.  AFD  found  the  classical  algorithms 
for  a  number  of  functions. 

Some  applications  demonstrate  the  importance  of  DFC’s  generality.  The  correla¬ 
tion  coefficient  between  DFC  and  a  ranking  of  patterns  by  people  was  0.8.  In  a  data 
compression  application  on  typical  files,  the  correlation  coefficient  between  DFC  and 
the  compression  factor  of  two  commercial  data  compression  programs  was  about  0.9. 
However,  on  an  atypical  file,  AFD  had  the  much  better  compression  factor  of  0.04 
versus  0.86  and  0.94  for  the  commercial  programs.  In  a  machine  learning  applica¬ 
tion,  AFD  did  as  well  as  a  NN  on  problems  well  suited  to  NN’s.  However,  on  another 
problem,  AFD  learned  a  128  point  function  from  30  samples  whereas  the  NN  required 
aU  128  points.  These  applications  demonstrate  that  traditional  paradigms  look  for  a 
particular  kind  of  pattern;  when  they  find  that  pattern  they,  do  well  and  when  they 
don’t  they  do  poorly.  PT  is  unique  in  that  it  does  not  look  for  a  particular 
kind  of  pattern,  it  looks  for  patterns  in  general. 

PT  is  a  pervasive  technology  and,  at  maturity,  may  revolutionize  the  approach  to 
many  computational  problems.  Potential  areas  for  early  application  of  PT  have  been 
identified,  such  as  algorithm  development  for  Non-Cooperative  Target  Identification. 
PT  has  also  shown  promise  in  machine  learning  and  data  compression  problems.  Al¬ 
though  PT  1  has  laid  the  foundation  and  even  identified  potential  early  applications, 
there  are  many  unsolved  problems.  PT  2  will  begin  to  address  the  issues  that  limit 
the  application  of  decomposition  to  small  problems. 


166 


Appendix  A 

Program  Length  and  the 
Combinatorial  Implications  for 
Computing 


A.l  Mathematical  Preliminaries 
A.  1.1  Basic  Definitions 

A  partial  function  /  :  A  ->  V  is  a  set  of  ordered  pairs  from  X  xY,  where  there 
is  at  most  one  y  ^  Y  associated  by  /  with  any  x  G  A.  That  is,  (x,y)  G  /  and 
(®>3/0  €  /  implies  that  y  -  y'.  A  total  function  is  a  partial  function  with  the 
additional  property  that  /  is  defined  on  all  of  X,  That  is,  for  all  ®  G  A,  there  exists  a 
y  qY  such  that  {x,y)  G  /.  We  will  also  use  the  word  “mapping”  for  a  function.  The 
set  X  is  caUed  the  domain  of  /.  The  set  Y  is  called  the  codomain  of  /.  The  first 
element  of  all  the  pairs  in  /  form  a  set  called  the  base  (this  is  not  standard).  The 
base  of  /  is  a  subset  of  the  domain  of  /.  If  /  is  total  then  the  base  of  /  equals  the 
domain  of  /.  The  second  element  of  all  the  pairs  of  /  form  a  set  called  the  range. 
The  range  is  a  subset  of  the  codomain.  The  image  of  an  element  x  in  the  domain 
of  /  is  the  corresponding  element  y  =  f{x)  in  the  range  of  /.  When  the  range  of  a 
mapping  equals  the  codomain  of  the  mapping,  the  mapping  is  said  to  be  surjective 
or  “onto.”  Number  theoretic  functions  are  functions  of  the  form  /  :  A”  — >  A, 
where  N  is  the  set  of  natural  numbers. 

The  cardinality  of  a  set  A,  denoted  [A],  is  the  number  of  elements  in  the  set  A  if 
A  is  a  finite  set.  The  cardinality  of  an  infinite  set  is  determined  by  which  standard  set 
it  corresponds  to  in  a  one-to-one  manner.  The  cardinals  associated  with  infinite  sets 
are  cedled  transfinite  cardinals.  Since  a  function  /  is  also  a  set,  [/]  is  well-defined 
and  indicates  the  size  of  a  function.  Let  c  indicate  the  cardinality  of  the  continuum 
(i.e.  the  real  numbers)  and  indicate  the  cardinality  of  a  countably  infinite  set. 
We  know  that  equals  c  for  any  finite  n  greater  than  one  (20,  p.l55  Equation  2]. 
In  general,  a  finite  number  to  the  power  of  a  transfinite  cardinal  is  the  next  larger 


167 


cardinal.  We  have  a  particular  interest  in  the  cardinality  of  sets  involving  the  Natural 

numbers  {N).  All  elements  of  N  are  finite.  However,  N  is  unbounded.  That  is,  there 

does  not  exists  a  6  G  such  that  n  <  6  for  all  n  £  AT.  Therefore,  N  has  cardinality 

Ho  and  a  function  with  domain  AT,  also  has  cardinality  .  The  set  of  all  possible 

functions  with  domain  N  has  cardinality  c  [59,  p.l2].  The  relationship  “<”  can  be 

interpreted  in  a  natural  way  on  N  and  on  transfinite  numbers;  i.e.  n  <  <  c  for  all 

n  ^  N.  That  is,  the  elements  of  N  have  their  usual  order,  the  transfinite  cardinals  • 

have  their  usual  order  and  all  transfinite  cardinals  are  greater  than  all  finite  cardinals. 

The  relationship  “=”  also  has  a  natural  interpretation,  so  “<”  can  be  used  as  well. 

Let  S  be  a  set  of  symbols,  not  necessarily  of  finite  cardinality.  Let  Q  be  AT  or 
a  subset  of  AT  of  the  form  {n  E  N  \  n  <  T}  (ox  some  T  6  AT.  A  string  on  S  is  a 
mapping  from  Q  into  S  for  some  Q.  Define  S*  as  the  set  of  all  possible  strings  on  E. 

For  ®,  a  string  in  S*,  we  denote  the  length  of  ®  by  s(®).  We  have  not  adopted  the 
traditional  finite  limitation  on  [S]  and  s(a;).  A  language  over  S  is  a  subset  of  S* 

Let  Xi, ^2, . . . , Xn  be  arbitrary  sets.  An  element  of  a  product  of  sets  (e.g.  X]  x 
A’2  X  . . .  X  X„)  is  a  vector.  A  vector  is  also  a  string  with  alphabet  S  =  ATi  U  ^2  U 
. . .  U  Xn.  Therefore,  we  designate  the  length  of  a  vector  s(®)  as  its  length  as  a  string. 

For  a  vector  in  Afi  x  J^2  X  •  •  •  x  >  ^(®)  = 

A.1.2  Combinatorics 

Let  us  review  some  basic  combinatorics.  We  are  especially  interested  in  the  cardinality 
of  the  set  of  all  vectors  of  a  given  length,  the  set  of  all  strings  no  longer  than  a  given 
length,  and  the  set  of  all  functions  on  a  given  domain.  The  following  theorem  gives 
the  cardinality  of  a  set  of  vectors. 

Theorem  A.l  If  V  is  the  set  of  all  vectors  of  length  s(v)  =  n,  that  is  V  = 
thpn  there  are  vectors  in  V,  i.e.  (V)  = 

Proof: 

There  can  be  any  one  of  [S]  symbols  in  the  first  position.  Then  for  each  symbol  in 

the  first  position,  there  can  be  any  one  of  [S]  symbols  in  the  second  position.  Thus, 

the  number  of  combinations  for  a  two-dimensional  vector  is  [S][S)  =  [E]^.  Similarly, 

for  each  combination  of  the  first  two  symbols  there  can  be  any  one  of  [E]  symbols  in 

the  third  position.  Thus,  the  number  of  combinations  for  a  three-dimensional  vector 

is  [E]^[E]  =  [E]^.  This  argument  can  be  continued  to  find  that  there  are  [Ej*^’')  com-  ' 

binations  of  symbols  in  an  s(u)-dimensional  vector. 

□ 


We  now  develop  the  cardinality  of  a  set  of  strings  on  an  alphabet  E  of  [E]  letters. 


Theorem  A. 2  If  L  is  the  set  of  strings  on  E  whose  lengths  are  less  than  or  equal  to 
some  threshold  s{l')  then  [L]  = 


168 


Proof: 

If  i  is  a  subset  of  S*.  Let  I'  be  a  string  in  L  of  the  maximum  allowed  length.  There 
are  strings  of  length  0, 1, 2, ,  a{V).  For  each  string  of  length  i  there  are  [S]'  different 
strings  (as  with  vectors).  Therefore,  there  are  a  total  of 

»(/') 

EPl' 

t=0 

strings  in  L.  Thus, 

' '  =  S‘ ' - pR  -  iRi 

ES’IS)'  +  -  |B]»  +  EffllS)'  -  1 

(Sl-1  ISJ-l  • 

□ 

The  expression  for  the  number  of  vectors  is  simpler  than  the  expression  for  the 
number  of  strings,  yet  there  is  not  much  difference  between  the  two. 

Theorem  A.3  For  finite  a{l')  and  a{l')  =  s(v),  we  have  [F]  <  [L]  <  2[y]. 

Proof: 

The  first  inequality  follows  from  V  being  a  proper  subset  of  L,  The  second  inequality 
can  be  developed  as  follows: 

[s]^(n+i  >  [s]*(n+i  _  1  =  [s](s]*('')  - 1  >  2[s]*('')  - 1, 
assuming  [E]  >  2.  Thus, 

+  1  >  2(S)*('')  -  1, 

(S]^(n  +  1  _  2[S]»('')  >  -1, 

2[S]*('')  +  1  -  2(S)*('')  >  -  1, 

2[S]*('')([E]  -  1)  >  -  1, 

2[S]*('')  >  ((S]"('')+^  -  1)/([S]  -  1). 

Finally,  since 

[V]  =  =  [E]’('') 

and 

|il  =  -  1)/((S|  -  1), 

we  have  2[F]  >  (L). 

□ 


169 


Also,  for  transfinite  s{l'),  we  have  [L]  =  [V];  i.e.  the  addition,  subtraction  or 
multiplication  of  a  finite  number  with  a  transfinite  cardinal  yields  the  same  transfinite 
cardinal.  Therefore,  for  finite  or  transfinite  s{l')^  [L]  and  [F]  are  very  similar.  Let  F 
be  the  set  of  all  total  functions  of  the  form  /  :  X  — >  F,  where  X  and  Y  are  arbitrary 
sets.  A  set  of  strings  satisfies  the  prefix  condition  if  no  string  in  the  set  contains 
another  string  from  the  set  as  its  first  letters. 

Theorem  A.4  The  largest  number  of  different  strings  of  length  <  I  and  satisfying 
the  prefix  condition  is  the  number  of  vectors  of  length  1. 

Proof: 

Any  string  of  length  <  I  eliminates  all  longer  strings  with  that  beginning  from  the  set 
of  strings  satisfying  the  prefix  condition.  There  is  always  more  than  one  string  elimi¬ 
nated  by  a  string  of  length  <1.  A  string  of  length  I  only  eliminates  itself.  Therefore, 
the  largest  set  of  strings  satisfying  the  prefix  condition  is  the  set  of  stings  of  length 
exactly  1. 

□ 


The  following  theorem  gives  the  cardinality  of  a  function. 

Theorem  A. 5  If  F  is  the  set  of  all  total  functions  of  the  form  f  \  X  Y  then 

|f|  =  lyii-'-i. 

Proof: 

Each  function  f  £  F  can  be  thought  of  as  a  vector  made  up  of  the  sequence  of  images 
under  /  for  all  x  €  X.  Thus,  the  cardinality  of  F  is  the  cardinality  of  a  set  of  vectors 
of  length  [X]  over  the  symbol  set  Y. 

a 


We  might  have  assumed  that  F  is  the  set  of  all  partial  (rather  than  total)  functions 
on  a  given  domain.  However,  we  can  model  any  partial  function  g  :  X  —*  Y  with  a 
total  function  f  :  X  —>  (Y  U  {y'}),  where  y'  is  some  symbol  not  in  Y.  That  is,  let 
/(®)  =  ^(aj)  for  any  x  in  the  base  of  g  and  let  /(x)  =  y'  for  all  other  x’s.  Therefore, 
the  set  of  all  partial  functions  into  Y  has  a  one-to-one  correspondence  with  the  set  of 
all  total  functions  into  Y  U  {y'}.  The  number  of  partial  functions  into  Y  equals  the 
number  of  total  functions  into  Y  U  {y'}.  The  number  of  total  functions  into  Y  U  {y'} 
is  ([y]  +  The  number  of  partial  functions  into  Y  is  then  ([y]  +  Therefore, 
whether  we  use  F  as  the  set  of  all  total  functions  or  as  the  set  of  all  partial  functions, 
there  is  little  difference  in  the  essential  combinatorics. 


170 


A.2  Program  Length  Constraints  for  Computa¬ 
tion 

A. 2.1  Introduction 

In  most  texts  on  Computability  Theory  you  will  find  a  statement  such  as;  there  exist 
number  theoretic  functions  (i.e.  functions  on  finite  numbers)  which  cannot  be  com¬ 
puted  with  any  finite  program.  The  proof  of  this  statement  depends  only  upon  some 
combinatorial  arguments.  While  this  is  clearly  a  most  impressive  consequence  of  these 
combinatorial  considerations,  the  consequences  are  much  broader  than  indicated  by 
that  one  statement.  The  objective  of  this  appendix  is  to  recognize  and  precisely  state 
the  general  computing  limitations  implied  by  combinatorial  considerations.  As  com¬ 
pared  with  the  traditional  statement  of  noncomputability,  the  following  statements 
are  stronger  and  more  general,  yet  equally  easy  to  derive. 

Now  it  is  reasonable  to  ask  whether  or  not  there  exists  a  machine  for  a  given  P 
and  F.  We  know  from  above  that,  regardless  of  the  particular  association  desired 
between  P  and  F,  these  simple  combinatorial  constraints  must  be  satisfied.  If  they 
are  not  satisfied  then  we  say  that  no  machine  exists  which  can  be  programmed  from 
P  to  compute  any  element  of  F.  In  particular,  if  we  consider  P  =  1^}^  then  [P]  is  Ilo. 
If  we  also  consider  F  to  be  the  set  of  all  functions  on  the  set  of  all  finite  strings  then, 
although  s{x)  is  finite,  [X]  (or  [/])  is  and  [F]  is  c.  Therefore,  [F]  is  not  greater 
than  or  equal  to  [F]  and,  consequently,  there  does  not  exist  a  machine  which  can  be 
programmed  from  P  to  compute  any  element  of  F.  This  is  the  traditional  statement 
of  noncomputability. 

We  feel  that  the  generalized  developments  of  this  appendix  allow  a  more  intuitive 
introduction  to  computability  as  well  as  the  more  general  implications  of  program 
length  constraints.  That  is,  first  develop  the  max-min  program  length  constraints. 
These  constraints  are  intuitive  and  easy  to  develop  from  basic  combinatorial  con¬ 
siderations.  With  an  understanding  of  these  constraints,  we  can  easily  derive  the 
traditional  noncomputability  result  as  well  as  many  other  more  general  or  stronger 
results.  The  existence  of  non  computable  functions  is  uncoupled  from  the  traditional 
sources  of  confusion  such  as  the  relationship  between  transfinite  cardinals  and  the  rich 
(but  irrelevant)  structure  of  Turing  machines  and  number  theoretic  functions.  There¬ 
fore,  although  this  alternative  introduction  to  noncomputability  is  more  general  and 
stronger,  it  can  also  be  more  intuitive.  We  can  then  develop  average-minimum  pro¬ 
gram  length  constraints  by  applying  Information  Theoretic  concepts  to  this  paradigm. 
There  are  several  practically  important  applications  of  these  program  length  con¬ 
straints. 


171 


Program 

_Jd_ 


Machine 


Machine 


h4f(x) 


Accepting  a  program.  Realizing  a  function. 

4 


Figure  A.l:  A  Machine’s  Interfaces 

A.2.2  Programmable  Machines 

Definition  of  a  Programmable  Machine 

The  objective  of  this  appendix  is  to  generalize  the  development  of  the  combinatorial 
implications  on  computing.  One  area  of  extended  generality  is  our  model  of  a  ma¬ 
chine.  This  section  will  define  our  machine  model.  Many  computational  systems  are 
well  modeled  (in  terms  of  the  combinatorics  of  interest)  by  this  machine,  including 
traditional  programmable  systems,  analog  computing,  logic  circuit  design  and  ma¬ 
chine  learning.  Therefore,  the  theoretical  developments  with  respect  to  this  model 
have  far  ranging  applications;  yet  this  model  is  adequate  for  the  many  results  based 
on  combinatorial  considerations  of  the  functions  and  programs  involved.  To  say  that 
a  very  general  M  exists  does  not  tell  us  much,  but,  we  will  be  specifying  conditions 
under  which  M  does  not  exist.  Therefore,  the  generality  of  the  machine  model  makes 
the  non-existence  results  stronger. 

A  programmable  machine  (PM)  consists  of  a  programming  language  P,  a  set 
of  functions  F,  and  a  surjective  mapping  M  :  P  F,  i.e.  PM=  {P,  F,M}.  When 
we  say  “machine”  we  mean  a  closed  physical  system  (i.e.  a  black  box)  which  may 
include  people.  When  we  say  “PM”  we  mean  a  model  of  the  physical  machine.  A 
program  is  a  string  from  an  alphabet  Sp  which  satisfies  the  prefix  condition.  The 
prefix  condition  says  that  a  program  p  cannot  contain  any  other  program  as  its  first 
i  characters,  i  =  1,2, ...  ,s(p).  This  condition  is  necessary  and  sufficient  to  make  it 
possible  for  a  machine  to  be  able  to  decide,  after  accepting  each  program  character, 
whether  or  not  a  complete  program  has  been  accepted.  Therefore,  P  is  a  subset  of  Sp 
and  is  formally  a  language.  Of  course  P  could  be  a  traditional  programming  language 
(e.g.  Fortran);  however,  it  can  also  be  a  circuit  specification,  data  samples,  etc.  The 
length  of  a  program  (p)  is  its  length  as  a  string,  i.e.  5(p). 

A  machine  interfaces  with  the  rest  of  the  world  through  three  avenues  (Figure  A.l). 
A  machine  has  the  ability  to  accept  a  program  and  realize  a  function.  A  machine 
realizes  a  function  if  when  presented  with  any  element  from  the  function’s  domain, 
the  machine  automatically  (i.e.  without  any  help  from  outside  the  machine)  produces 
the  corresponding  element  from  the  function’s  range.  The  idea  that  functions  are 


172 


Figure  A. 2:  “Programs”  in  a  Communications  Context 

an  especially  important  model  of  computational  processes  is  well  developed  in  the 
Computing  Theory  literature  (e.g.  [50,  59]),  so  we  do  not  redevelop  it  here.  However, 
we  point  out  that  nearly  all  computational  systems  realize  a  function.  A  machine  has 
accepted  a  program  if  it  then  realizes  the  function  associated  with  that  program. 

We  can  think  of  a  “program”  as  a  message  sent  to  a  machine  instructing  it  to 
realize  a  particular  function  (Figure  A.2).  The  machine  is  capable  of  realizing  any 
one  of  the  functions  in  F  when  given  the  appropriate  message  from  P.  A  program  is 
everything  necesfury  to  specify  a  particular  function  with  respect  to  a  machine  capa¬ 
ble  of  realizing  a  variety  of  functions.  As  a  “program”  crosses  this  communications 
channel  we  have  an  excellent  opportunity  to  characterize  its  size.  This  communica¬ 
tions  perspective  of  a  program  proves  very  useful  in  the  section  on  Average-Minimum 
program  length. 

Programmable  Machines  are  a  very  general  model  of  computational  systems.  A 
PM  is  an  adequate  model  of  many  (perhaps  all)  computational  systems  because  the 
semantics  of  the  programming  language,  which  is  how  computational  systems  dif¬ 
fer,  are  unimportant  with  respect  to  the  combinatorics.  PM’s  include  traditional 
programmable  systems  (as  in  a  general  purpose  digital  computer)  and  other  compu¬ 
tational  systems  that  are  not  obviously  “programmable.”  We  demonstrate  the  idea 
of  a  PM  with  five  examples.  For  each  example,  we  discuss  P,  F,  M,  the  meaning  of 
“program  length”  within  the  context  of  the  example,  and  how  the  prefix  condition  is 
satisfied. 

Examples  of  Programmable  Machines 

Random  Access  Memory  (RAM)  has  a  natural  PM  model.  Consider  a  simplified 
model  of  RAM  consisting  of  n  address  lines,  m  data  lines  and  a  Read/Write  line  (Fig¬ 
ure  A. 3).  A  “program”  p  for  the  RAM  might  consist  of  a  sequence  of  Address-Data 
combinations  with  a  “Write”  value  on  the  Read/Write  line,  e.g.  p  =  (rfi,(f2,...,c/*.) 
where  di  =  {Write,  Addressi,  Datai}.  P  for  the  RAM  might  consist  of  all  possible 
such  p’s.  A  function  /  realized  by  a  RAM  maps  Read- Write  x  Address  into  Data,  i.e. 
{Read}  x  {0, 1}”  — >  {0, 1}"‘,  and  F  might  be  the  set  of  all  possible  such  functions.  M 
defines  /  from  p,  that  is,  f{x)  =  Dcta,-  if  there  exists  a  d  G  p  such  that  the  Address 


173 


Address 


Read/Write 


Data 


Figure  A. 3:  Simplified  RAM  Model 

of  d  is  ®  and  the  Data  of  d  is  Data{.  Program  length  (in  bits)  with  respect  to  RAM 
is  the  number  of  bits  required  to  specify  a  program.  For  p  =  (di,d2> •  •  •  and 
di  =  {Write,  Addressi,  Datai}  ,  each  d{  consists  of  1  bit  for  Read/Write  plus  n  bits 
of  address  plus  m  bits  of  data.  There  are  k  d,-’s;  therefore,  there  are  k{l  +  n  +  m) 
bits  in  a  program.  For  total  functions,  k  =  2",  so  there  are  2"(1  +  n  +  m)  bits  of 
program  (i.e.  3{p)  =  2"(1  +  n  +  m)).  Therefore,  we  can  define  P,  P,  and  M  (i.e.  PM) 
such  that  PM  is  a  model  of  RAM.  In  this  context  the  theoretical  results  which  we 
will  develop  concerning  minimum  program  length  have  an  interpretation  as  minimum 
storage  requirements. 

For  our  second  example  of  the  generality  of  the  PM  model  we  consider  Turing 
machines.  Turing  machines  are  standard  models  in  Computing  Theory  and  are  known 
to  be  equivalent  (in  terms  of  computability)  to  many  other  models.  Define  P  to  be  the 
set  of  Turing  machines.  As  pointed  out  in  [59,  p.295],  a  Turing  machine  is  completely 
defined  by  its  transition  function.  Therefore,  in  the  context  of  this  example,  a  program 
is  a  transition  function.  F  is  the  set  of  computable  functions.  The  exact  domain  and 
codomain  of  /  €  P  are  determined  by  the  input  and  tape  symbol  sets.  M  is  the 
function  realized  by  a  person  or  computer  capable  of  1)  taking  a  transition  function 
and  constructing  (or  simulating)  the  Turing  machine  (i.e.  accept  the  program)  and 
•  2)  running  the  Turing  machine  (i.e.  realize  the  function).  “Program  length”  is  the 
size  of  the  transition  function.  Traditional  computing  languages  (e.g.  Fortran)  can 
similarly  be  modeled. 

As  a  third  example  of  the  PM  model  consider  a  “learning  by  example”  machine. 
We  can  model  a  learning  machine  by  defining  a  program  p  to  be  a  sequence  of  exam¬ 
ples.  P  is  the  set  of  all  example  sequences  of  interest.  P  is  the  set  of  all  functions 
which  can  be  learned  by  the  machine.  M  models  the  learning  system.  M  accepts  a 
program  by  “learning”  (e.g.  forming  rules,  forming  feature  vectors,  adjusting  weights, 
etc.).  M  realizes  a  function  in  its  post-learning  behavior.  The  “program  length”  is  the 
size  of  the  example  set.  In  this  context,  the  developments  about  minimum  program 


174 


length  and  “programmability”  become  minimum  example  requirements  and  “learn- 
ability.”  This  example  could  be  adapted  to  other  learning  modes,  where  whatever  it 
is  that  the  machine  learns  from  acts  as  the  program. 

In  this  example  of  a  PM  we  connect  “program  length”  with  circuit  complexity. 
Define  a  program  to  be  the  specification  of  a  logic  circuit.  P  is  the  set  of  all  such 
specifications.  F  is  the  set  of  all  functions  of  the  form  /  :  {0, 1}"  —>■  {0, 1}.  M  is  the 
function  realized  by  1)  a  person  or  automated  manufacturing  system  which  “accepts 
a  program”  by  constructing  a  circuit  per  the  specifications  and  2)  the  constructed 
circuit  which  realizes  a  function.  There  are  many  ways  in  which  a  logic  circuit  might 
be  specified  (e.g.  ((tn  AND  x)  OR  (  NOT  y))  AND  z).  However,  any  specification 
must  identify  the  gates  (e.g.  AND,  OR,  NOT)  and  the  interconnection  of  the  gates. 
Therefore,  program  length  (i.e.  length  of  the  circuit  specification)  is  related  to  the 
complexity  of  the  circuit  (i.e.  number  of  gates  and  number  of  interconnections).  In 
this  context,  the  developments  concerning  minimum  program  length  become  limits 
on  minimum  circuit  complexity. 

The  discussion  in  the  previous  example  did  not  depend  on  the  circuits  having 
discrete  values.  Therefore  we  could  define  a  PM  for  analog  computing  (electronic, 
acoustical,  optical,  etc.).  Programs  would  refiect  an  identification  of  the  components 
and  their  interconnection.  F  would  be  the  set  of  functions  realizable  by  some  specified 
analog  computer. 

Tabular  Programs 

There  is  a  special  form  of  program,  for  most  machines,  with  the  following  charac¬ 
teristics.  One,  every  function  in  F  has  a  representation  of  this  form  and  two,  the 
length  of  the  program  in  this  form  is  the  same  for  every  function  in  F,  We  call  a 
representation  of  this  form  a  tabular  representation.  A  tabular  representation  is  es¬ 
sentially  an  exhaustive  list  or  table  of  the  function’s  values.  Revisiting  the  examples 
above,  every  RAM  program  is  tabular.  That  is,  every  function  has  a  program  and 
all  programs  are  the  same  length.  With  respect  to  a  Turing  Machine,  a  transition 
function  which  includes  a  transition  for  each  ordered  pair  of  /  (i.e.  x  and  f{x))  is 
a  tabular  representation.  In  this  case,  as  with  the  RAM,  the  size  of  the  transition 
function  (program  length)  corresponds  to  the  size  of  /  (i.e.  s{p)  oc  [/]).  A  tabular 
representation  with  respect  to  a  learning  machine  simply  means  that  the  “example 
set”  is  an  exhaustive  list  of  the  ordered  pairs  of  /.  That  is,  there  is  an  example  in 
the  training  set  for  each  possible  input.  Again,  the  size  of  the  example  set  (program 
length)  corresponds  to  the  size  of  /.  A  tabular  representation  with  respect  to  logic 
circuits  could  be  constructed  using  canonical  forms.  A  tabular  representation  of  a 
function  is  a  kind  of  the  backstop  of  all  representations. 

The  Set  of  Minimum  Length  Programs 

As  our  final  comment  on  machine  models  we  identify  a  special  subset  of  the  set  of 
all  programs.  A  program  specifies  which  element  of  F  that  the  machine  is  to  realize. 


175 


Each  program  has  exactly  one  function  associated  with  ^t.  However,  each  function 
may  have  zero,  one,  or  more  associated  programs.  That  is,  a  given  program  can  only 
realize  one  function,  but,  a  particular  function  may  have  any  number  of  programs. 
Let  Pmim  a  subset  of  P,  consist  of  those  elements  of  P  which  do  not  have  a  shorter 
program  that  realizes  the  same  function..  That  is, 

Pmin  =  {p  e  P  I  s{p)  <  s{q)Wq  €  P  and  M(p)  =  M{q)}, 

where  M{p)  indicates  the  function  realized  by  p.  A  largest  element  of  P^in  is  denoted 
p',  so  that  s(p')  >  a(p)Vp  G  Pmin*  That  is,  p'  is  a  largest  program  of  P^fn  required  to 
realize  any  realizable  function  of  P.  If  s(p')  is  greater  than  some  threshold  (t)  then 
it  immediately  follows  that  there  exists  a  program  of  length  greater  than  t.  That  is, 
p'  is  a  program  of  length  greater  than  t.  s(p')  will  be  called  the  max-min  program 
length  of  set  P  with  respect  to  M. 

A.2.3  Maximum-Minimum  Program  Length  for  Finite  and 
Transiinite  Sets 

The  following  theorem  relates  function  set  and  program  set  cardinalities. 

Theorem  A.6  //(M,P,P)  is  a  Programmable  Machine  then  [P]  >  [F], 

Proof: 

By  definition  of  a  function,  each  p  G  P  can  be  associated  with  at  most  one  f  G  F\ 
or  a  particular  program  can  only  compute  one  function.  Also,  since  M  is  onto  P, 
there  is  a  (p,/)  pair  for  every  /  G  P.  Therefore,  for  every  f  ^  F  there  is  at  least  one 
unique  p  G  Pmm  and  P„i„  C  P. 

□ 


This  seems  trivial.  However,  it  is  the  basis  for  all  the  combinatorics  based  non¬ 
computability  results.  Now  let  p'  denote  the  longest  program  in  Pmin-  Now  we  relate 
function  set  cardinality  and  program  length. 

Theorem  A, 7  If{M,  P,  P)  is  a  Programmable  Machine  and  the  programs  are  strings 
on  an  alphabet  Ep  of[Sp]  letters  then  ^  [•^l* 

Proof: 

Theorem  A. 2  demonstrated  that  the  set  of  all  strings  of  length  less  than  or  equal  to 
Kp')  Therefore,  any  particular  set  P,  with  a  longest  string  p',  cannot 

have  more  than  elements,  i.e.  >  [jP]-  This  inequality  and  that 

of  Theorem  A.6  gives  us  ^  1^1  ^  l-P]* 


176 


Theorem  A.8  if{M,  P,F)  is  a  Programmable  Machine  and  the  programs  are  vectors 
on  ah  alphabet  Up  o/[Sp]  letters  then  >  [F]. 

Proof: 

Same  as  Theorem  A.6,  except  replace  the  expression  for  the  cardinality  of  a  set  of 
strings  with  the  expression  for  the  cardinality  of  a  set  of  vectors  (Theorem  A.l). 

*  □ 


Theorem  A.8  may  be  interpreted  as  an  approximation  of  the  constraint  of  The¬ 
orem  A.7.  This  approximation  is  correct  to  within  a  factor  of  2  by  Theorem  A.3. 
The  discussion  will  use  this  simpler  and  essentially  correct  expression.  We  will  call 
>  [F]”  the  first  max-min  program  length  constraint. 

What  does  the  first  max-min  program  length  constraint  imply  about  the  relation¬ 
ship  between  function  cardinality  [/]  and  program  length  s{p')l  In  general  nothing. 
That  is,  the  function  set  F  and  individual  elements  of  F  can  have  any  combination  of 
cardinalities.  For  example,  let  the  individual  functions  be  of  the  form  with 

i,j  e  N,  In  this  example,  [F]  =  while  [/]  =  1.  On  the  other  hand,  suppose  we 
are  only  interested  in  two  functions,  namely  f{x)  —  x  and  g{x)  =  »  +  1,  each  on  the 
Reals.  In  this  case,  [F]  =  2,  while  [/]  =  c.  Therefore,  in  general,  the  combinatorial 
constraint  implies  nothing  about  the  relationship  between  function  cardinality  and 
max-min  program  length. 

Let  us  assume  that  the  functions  of  interest  are  exactly  all  the  total  functions  on  a 
given  domain  {X).  This  assumption  creates  a  relationship  between  [F]  and  (/].  This 
allows  us  to  use  the  combinatorial  constraint  to  relate  [/]  and  s[p'). 


Theorem  A. 9  //(M,  F,  F)  is  a  Programmable  Machine,  the  programs  are  strings 
on  an  alphabet  Sp  o/[Sp]  letters  and  F  is  the  set  of  all  total  functions  of  the  form 
f  :  X  —*Y  then 


[Sp|^(p')+i  _  1 

(Sp)  - 1 


>  (I')'''*- 


Proof: 

By  Theorem  A.4  [F]  =  [y’jt-''!.  Theorem  A. 9  then  follows  immediately  from  Theo¬ 
rem  A. 7. 

□ 


Theorem  A. 10  If{M,P,F)  is  a  Programmable  Machine,  the  programs  are  vectors 
on  an  alphabet  Sp  of  [Sp]  letters,  F  is  the  set  of  all  total  functions  of  the  form 
f  :X  -*Y  and[Y]  =  [Sp]  then  s{p')  >  [X]. 


Proof: 

of  Theorem  A.9  is  replaced  by  [Sp]*^''*)  as  in  Theorem  .\.8  to  get  [Sp]*^'^*^  > 
[y]t'''l  Since  [F]  =  [S/>],  a  log  of  both  sides  gives  s(p')  >  [vY]. 

□ 


177 


Note  that  [-Y]  =  [/)  for  total  functions.  When  we  are  talking  about  decision 
problems  (i.e.  [F]  =  2)  and  a  binary  alphabet  (i.e.  [Dp]  =  2),  this  expression  applies. 
Again,  by  fixing  some  particulars  and  approximating  the  cardinality  of  a  set  of  strings 
we  can  simplify  the  expression  and  better  reveal  the  essential  relationships.  We  call 
this  the  second  max-min  program  length  con!>t’'aint. 

A  comparison  between  function  cardinality  and  program  length  seems  to  be  espe¬ 
cially  natural.  The  representation  of  a  function  as  a  look-up  table  is  always  possible, 
although  infinite  functions  require  infinite  tables.  We  can  think  of  a  table  with  a 
look-up  capability  as  a  “program”  and  the  length  of  this  program  is  exactly  that  of 
the  function.  This  is  the  basis  for  many  of  the  developments  of  Chapter  4. 

We  can  think  of  the  length  of  an  input  in  two  ways.  The  length  of  the  input  as  the 
length  of  a  string  might  be  one  measure.  For  functions  with  inputs  from  N,  the  input 
length  might  be  the  value  of  the  input.  However,  the  set  N  and  the  set  of  finite  length 
strings  have  the  same  cardinality,  therefore  it  makes  no  difference  which  approach  we 
use.  Let  us  treat  input  length  as  the  length  of  the  string  that  expresses  the  input, 
and  denote  it  s(a;).  Now  what  does  the  combinatorial  constraint  imply  about  the 
relationship  between  input  length  and  program  length?  Again,  in  general  nothing. 
That  is,  the  function  set  F  and  an  individual  element  f  ^  F  can  have  independent 
cardinalities  (as  indicated  before)  and  the  cardinality  of  an  input  ®  to  /  can  be 
independent  of  [/].  For  the  example  with  individual  functions  of  the  form  {(i,;)}, 
with  i,j  €  N,  [F]  =  llo  while  s{x)  =  1.  On  the  other  hand,  for  the  two  functions, 
f{x)  =  X  and  5f(®)  =  a:  -fl,  each  on  the  Reals,  [i^]  =  2  while  5(»)  =  Ho-  Therefore,  in 
general,  the  combinatorial  constraint  implies  nothing  about  the  relationship  between 
input  length  and  max-min  program  length. 

We  established  in  Theorem  A. 9  that,  with  an  assumption  about  the  form  of  the 
function  set  F,  we  could  relate  function  cardinality  [/]  to  function  set  cardinality  [F], 
Now  we  introduce  an  additional  assumption  which  allows  us  to  relate  input  length 
5(aj)  and  function  set  cardinality  [F].  Let  us  assume,  as  before,  that  the  functions  of 
interest  are  exactly  all  the  total  functions  on  a  given  domain  X  and  codomain  F.  In 
addition  let  us  assume  that  the  domain  X  is  the  set  of  all  strings  on  an  alphabet  S.y 
of  [S.y]  letters.  The  following  theorem  relates  input  length  and  program  length. 


Theorem  A. 11  //(M,P,  F)  is  a  Programmable  Machine,  the  programs  are  strings 
on  an  alphabet  'Lp  of  [S/p]  letters,  F  is  the  set  of  all  total  functions  of  the  form 
f  :  X  —*Y  and  X  is  the  set  of  all  strings  on  S.y  of  length  less  than  or  equal  to  s(x') 
then 

(p')+l  _  1 

- -  >  [F]  ■" 

/p]  -  1 


IM 

(s 


or  equivalently 


[S;p]*(P')+’  -  1 

(SH  -  1 


>m 


lElPi 


Vs€X 


4 


178 


Proof:  , 

From  Theorem  A.2,  [X]  =  Theorem  A. 11  then  follows  immediately  from 

Theorem  A.9. 

□ 

Theorem  A. 12  If  (ilf,  P,  F)  is  a  Programmable  Machine,  the  programs  are  strings 
on  an  alphabet  Sp  of  [Sp]  letters,  F  is  the  set  of  all  total  functions  of  the  form 
f  :  X  —*Y  and  X  is  the  set  of  all  strings  on  Sa'  of  length  less  than  or  equal  to  s(®') 
then  s(p')  >  [Sa']*^**^‘ 

Proof: 

By  Theorem  A.l  [X]  =  [Sa']*^^  ^  The  proof  then  follows  from  Theorem  A. 10. 

□ 


Again,  the  expression  of  the  constraint  can  be  greatly  simplified  by  assumptions 
and  approximations  in  non-essential  areas.  Under  these  assumptions,  if  a  machine 
can  realize  all  functions  with  input  less  than  or  equal  to  s(®')  then  5(p')  is  greater 
than  or  equal  to  the  power  of  input  length,  i.e.  s(p')  >  [Sa']*^®*^*  This  might  also 
be  expressed  3(p')  >  [Sa-]*^*V®  €  X.  This  is  the  third  max-min  program  length 
constraint. 

The  remainder  of  this  section  is  a  summary  of  the  developments  so  far.  We 
discussed  the  important  role  of  simple  combinatorics  in  deciding  the  max-min  program 
length.  In  particular,  for  there  to  exist  a  sufficient  number  of  programs,  at  least  one 
for  each  function,  the  longest  program  must  be  at  least  a  certain  length.  Another  way 
to  express  this  is  that  there  must  exist  a  program  of  a  certain  length  s(p')  or  longer. 
We  developed  three  statements  of  this  combinatorial  constraint:  1)  >  [F], 

2)  a(p0  >  [/],  and  3)  3(p')  >  [Ea’]*^*^V®  €  X.  We  called  these  the  first,  second  and 
third  max-min  program  length  constraints,  respectively.  We  did  not  assume  finite 
cardinalities  in  the  development  of  these  constraints;  therefore,  all  the  constraints  are 
valid  for  finite  an  i  transfinite  cardinals. 

The  first  statement  says  that  there  must  exist  a  program  whose  length  is  greater 
than  or  equal  to  the  number  of  elements  in  the  set  of  functions  that  the  machine 
can  be  programmed  to  realize.  This  is  the  most  general  and  most  explicit  of  the 
three  statements.  It  is  general  in  that  it  uses  neither  of  the  assumptions  required  for 
statements  2  and  3.  In  particular,  the  functions  and  their  domains  are  not  assumed 
to  be  of  a  particular  form.  The  first  statement  is  more  explicit  in  that  it  includes  [F] 
and  s{p'),  which  are  the  driving  parameters  of  the  problem.  The  other  parameters 
(i.e.  [/]  or  3(®))  determine  s(p')  only  indirectly  through  [F],  via  the  assumptions  of 
the  second  and  third  statements. 

The  second  statement  says  that  there  must  exist  a  program  with  length  greater 
than  or  equal  to  the  number  of  elements  in  the  functions  that  the  machine  can  be 
programmed  to  realize.  This  statement  requires  that  we  assume  a  function  set  of  a 
certain  form.  Therefore,  it  is  not  as  general  as  the  first  statement.  However,  this 
statement  has  an  attractive  intuitive  interpretation.  A  function  is  a  set  of  cardinality 


179 


[/],  therefore,  it  is  reasonable  to  expect  a  representation  of  the  function  to  be  of 
cardinality  [/].  Furthermore,  any  function  can  be  represented  with  a  look-up  table 
(although  transfinite  functions  will  require  transfinite  table  sizes)  and  the  size  of  the 
table  is  exactly  [/]. 

The  third  statement  says  that  there  must  exist  a  program  with  length  greater  than 
or  equal  to  the  power  of  the  longest  input  string  to  the  functions  that  the  machine  can 
be  programmed  to  realize.  This  statement  requires  two  assumptions.  One  assumption 
concerns  the  functions’  form  relative  to  the  domain.  The  other  assumption  concerns 
the  form  of  the  domain  itself.  Therefore,  this  third  statement  is  the  least  general  of 
the  three.  These  assumptions  provide  the  necessary  link  back  to  [F]]  however,  [F]  is 
now  twice  removed  in  this  expression.  Therefore  this  third  statement  is  also  the  least 
explicit. 

Programmability 

We  have  seen  how  the  combinatorial  constraint  iihplies  that  “programs”  must  be 
a  certain  length  if  we  are  to  realize  a  certain  variety  of  functions.  However,  with 
the  machine  model  of  Section  A.2.1  we  see  that  “program”  has  an  unusually  general 
meaning.  For  example,  the  combinatorial  constraint  implies  that  the  specification  of 
a  circuit  must  be  so  long  if  the  specification  format  is  sufficiently  general  to  allow 
specifying  a  certain  variety  of  functions.  Similarly,  the  combinatorial  constraint  im¬ 
plies  that  a  certain  minimum  number  of  examples  are  required  if  a  learning  machine 
is  to  be  capable  of  learning  a  certain  variety  of  functions.  The  constraint  even  limits 
the  expressability  of  a  natural  language  in  instructing  a  person  in  a  task. 

We  would  now  like  to  develop  the  idea  of  “programmability.”  We  assume  that 
we  have  a  set  of  functions  {F)  which  we  wish  to  be  able  to  compute  and  a  set  of 
programs  (P)  which  we  are  capable  of  generating.  The  programmability  question  then 
is:  “Does  there  exist  a  Programmable  Machine  which  will  map  P  into  P?”  Another, 
more  traditional,  way  of  viewing  this  problem  is:  “Does  there  exist  an  /  €  P  which 
cannot  be  programmed  from  P?”  The  previous  results  provide  a  sufficient  condition 
for  the  existence  of  non-programmable  functions. 

Theorem  A. 13  Given  a  set  of  programs  P  and  a  set  of  functions  F  there  exist  a 
non-pro grammahle  function  if  [P]  <  if). 

Proof: 

There  are  more  /’s  in  P  than  there  are  p’s  in  P  and  any  given  p  can  realize  only  one 

/• 

□ 


We  can  produce  several  variants  of  this  result  using  the  conditions  of  Theo¬ 
rems  A. 8,  A. 10,  and  A. 12. 


180 


Theorem  A.14  If  (M,  P,  F)  is  a  machine,  P  is  the  set  of  strings  on  S/>  of  length 
less  than  or  equal  to  s{p'),  F  is  the  set  of  all  total  functions  of  the  form  f  :  X  —^Y 
and 

4p')  <  ijy]iog([y))/iog(|s,,j) 

then  there  exist  non-programmable  functions.  Further,  if  X  is  the  set  of  all  vectors 
of  length  s[v)  on  Sx  and 

»(?')  <  ISA-l‘<’>log(ly))/log(lS;,)) 

then  there  exist  non-programmable  functions. 

Proof: 

Substitute  [i'’]  =  (by  Theorem  A.4),  [P]  =  (by  Theorem  A.l)  and 

[A'"]  =  (by  Theorem  A.l)  into  Theorem  A. 13. 

□ 


With  a  number  of  additional  specializations  to  the  already  specialized  result  of 
Theorem  A.14,  we  can  arrive  at  the  traditional  statement  of  noncomputability.  Com¬ 
putability  can  be  defined  by  thresholds  of  allowed  resources  (a(p'))  and  required  per¬ 
formance  ([jP],  [/],  or  5(3:)).  If  these  two  thresholds  are  set  such  that  the  max-min 
program  length  constraint  is  violated  (i.e.  <  [P])  then  there  exist  noncom- 

putable  functions.  Otherwise,  the  max-min  program  length  constraint  does  not  imply 
noncomputability. 

The  practice  in  traditional  Coinputability  Theory  is  to  set  thresholds  which  re¬ 
quire  3(p')  and  j(a!)  to  be  finite  but  they  can  be  any  element  of  an  unbounded  set. 
That  is,  traditional  Computability  Theory  is  concerned  with  the  set  of  functions  with 
domain  N,  where  N  can  be  thought  of  as  a  set  of  strings.  The  assumptions  used  in 
developing  the  traditional  Computability  Theory  statement  of  noncomputability  are 
those  satisfied  in  the  third  max-min  program  length  constraint.  Noncomputability 
then  boils  down  to  the  fact  that  the  requirement  s(p')  >  for  all  finite  s(»), 
implies  that  s(p')  >  n  for  all  finite  n,  i.e.  that  s(p')  is  not  finite.  Since  we  allowed  all 
finite  5(0:),  and  required  s(p')  to  be  finite,  the  max-min  program  length  constraint  is 
not  satisfied.  Therefore,  we  say  there  exist  “noncomputable”  functions.  Since  5(®)  is 
unbounded  and  /  is  defined  for  all  x,  we  know  that  [/]  is  unbounded,  in  particular, 
[/]  =  Ho  and  [P]  =  c.  Therefore,  the  second  max-min  program  length  constraint 
becomes  s(p')  >  Ho  and  the  third  statement  becomes  Kp'^  ^  >  c.  As  expected,  s(p') 
•  is  not  finite. 

Theorem  A. 15  If  P  is  the  set  of  strings  on  Sf>  of  length  less  than  or  equal  to  some 
finite  3{p'),  F  is  the  set  of  all  total  functions  of  the  form  f  ‘.  X  Y,  X  is  the 
set  of  all  strings  of  finite  length  then  there  does  not  exist  a  Programmable  Machine 
(M,P,P). 


181 


Proof: 

From  Theorem  A.ll,  >  [F]  "l^^R  "V®  e  X  if  {M,P,F)  is  a  Pro¬ 

grammable  Machine.  However,  in  this  theorem  s(x)  is  unbounded,  thus  the  inequality 
of  Theorem  A.ll  implies  that  s{p')  must  be  such  that  the  left  side  is  greater  than 
or  equal  to  the  right  side  for  all  finite  s{z);  that  is,  s{p')  is  infinite.  Therefore,  a 
Programmable  Machine  is  inconsistent  with  the  conditions  of  this  theorem. 

□ 


This  last  theorem  is  equivalent  to  the  traditional  Computability  Theory  state¬ 
ment,  “There  exist  number  theoretic  functions  which  cannot  be  computed  with  a 
finite  program.”  It  is  interesting  to  express  this  in  the  context  of  the  first  and  second 
constraints  on  max-min  program  length.  The  traditional  statement  as  a  comparison 
of  function  set  cardinality  to  program  length  (as  in  the  first  constraint)  is:  “In  the 
infinite  set  of  functions  with  domain  N  there  exists  a  function  which  cannot  be  com¬ 
puted  with  a  finite  program.”  As  a  comparison  of  function  cardinality  and  program 
length,  we  might  express  the  traditional  Computability  Theory  statement:  “there  ex¬ 
ist  infinite  functions  that  cannot  be  computed  with  a  finite  program.”  So  expressed, 
noncomputability  is  much  more  intuitive.  The  theorems  of  this  section  are  a  general¬ 
ization  of  this  familiar  Computability  result.  These  statements  apply  for  any  function 
type  (even  continuous  functions),  machine  type,  or  programming  method.  Therefore, 
regardless  of  the  sophistication  of  the  machine,  the  efficiency  of  the  program,  the  de¬ 
gree  of  structure  in  the  functions  to  be  realized,  or  the  length  of  the  function’s  input, 
a(p')  is  bounded  by  the  above  constraints. 


Summary 

We  proposed  a  generalized  development  of  the  combinatorial  limitations  on  com¬ 
putability.  A  machine  is  a  function  M  :  P  F  from  a  set  of  programs  onto  a  set 
of  functions.  By  definition  of  a  surjective  function,  [P]  <  [F].  If  we  assume  F  is  a 
set  of  strings  on  an  alphabet  of  Kp  letters  with  no  elements  longer  than  a(p')  then 
essentially  Kp^^  <  [P].  Therefore,  Kp'*^  <  [F].  If  we  also  assume  that  F  is  the  set 
of  all  functions  on  X  into  Y  then  [F]  =  Therefore,  Kp’^^  <  [Fjl^l 

For  a  binary  programming  alphabet  and  decision  problems  this  becomes  s(p')  <  [/]. 
Finally,  if  we  also  assume  that  X  is  the  set  of  all  strings  no  longer  than  5(®')  then 

(X)  <  Therefore,  <  [F]^'-y  \  Again,  for  a  binary  programming  al¬ 
phabet  and  decision  problems  this  becomes  s(p')  <  ^  or  s(p')  <  for  all 

X  €  X.  Therefore,  in  order  for  a  machine  M  :  P  F  to  compute  any  function  in  F, 
there  must  be  a  program  p'  E  P  whose  length  is  such  that  Kp^  ^  >  [F],  which  under 
additional  assumptions  becomes  s(p')  >  [/]  or  s(p')  >  for  all  x  G  X. 


182 


A.2.4  Average-Minimum  Program  Length  Bound  for  Finite 
Sets 

We  now  develop  a  lower  bound  for  the  average  length  (i.e.  the  average  over  F)  of 
programs  in  when  [P]  is  finite.  First  we  introduce  a  few  new  terms.  As  before, 
P  is  a  set  of  programs  satisfying  the  prefix  condition  on  53f>.  P  is  a  finite  set  of 
functions.  Pmin  is  the  subset  of  P  containing  exactly  one  shortest  program  for  each  / 
in  P.  M  associates  a  unique  /  in  P  with  each  p  in  P.  11  is  a  probability  measure  on 
P,  i.e.  n  :  P  [0, 1)  and  the  sum  of  n(/)  over  all  /  6  P  is  1.  Using  this  probability 
measure,  we  define  the  average-minimum  program  length  Sp  in  the  natural  way: 

(F! 

Sf  =  Y1  n(/,Xp,),  where  p;  =  M(/.),  x  =  1, 2, . . . ,  [P],  and  p,-  €  P„,in. 

1=1 

Also  based  on  this  probability  measure,  we  can  define  the  entropy  H  oiF  with  respect 
to  H; 

(f)  1 

Entropy  has  units  associated  with  it  which  depend  on  the  base  of  the  logarithm.  For 
a  base  of  two  the  units  are  called  bits. 

Notice  the  exact  analogy  between  the  situation  defined  above  and  that  of  Source 
Encoding  in  Information  Theory  [23].  Our  set  of  functions  /i ,  /2,  ■  • . ,  f[F]  corresponds 
to  the  source  alphabet  tti,a2,...,aff  of  Information  Theory.  Our  program  alpha¬ 
bet  corresponds  to  their  code  alphabet  (D  =  [Sp]).  Our  minimum  program  lengths 
s(iW(/i)),a(iW’(/2)), . . .  ,a(M(/(Fj))  correspond  to  code  lemgths  ni,n2, . . .  Our 
average-minimum  program  length  Sf  corresponds  to  their  average  code  length  (n). 
Our  prefix  condition  for  programs  is  a  special  case  of  the  Information  Theory  re¬ 
quirement  for  unique  decodability.  The  lower  bound  on  average-minimum  program 
length  proven  below  is  the  analogy  of  the  “Source  Encoding  Theorem”  of  Information 
Theory.  This  theorem  was  first  proven  in  [56]  as  “The  Fundamental  Theorem  for  a 
Noiseless  Channel.”  The  proofs  given  below  are  from  [23]. 

Two  lemmas  will  be  needed  for  the  proof  of  the  average-minimum  program  length 
lower  bound. 

Lemma  A.l  ln{z)  <  z  ~  1  for  all  z>  0  and  equal  if  and  only  if  z  =  1. 

Proof: 

Consider  the  first  and  second  derivatives  of  ln{z)  ~  z  +  1. 

□ 


The  following  lemma  is  known  as  the  Kraft  inequality. 
Lemma  A. 2  If  P  is  a  set  of  programs  for  F  then 

(f) 

EPp]"''"’  <  1- 

1=1 


183 


Proof: 

Define  s'  =  max(s(pi),s(p2)>*i->^(P(F]))  i-e.  5'  is  the  largest  program  in  P,nin-  Let  A 
be  the  set  of  all  possible  programs  of  length  s'.  [A]  =  [Sp]*‘.  Each  program  p,-  in  Pn,;,, 
corresponds  to  a  subset  of  A  (i.e.  Ai)  as  follows:  Ai  =  {a  G  A|  the  string  made  up 
of  the  first  s(p,)  elements  of  a  is  the  same  as  p;}.  That  is,  Ai  is  the  set  of  all  strings 
of  length  s'  which  begin  with  the  substring  p,-.  All  of  the  A,’s  are  disjoint  because  of 
the  prefix  condition.  That  is,  if  a  is  in  both  A,  and  Aj  then  p,-  or  pj  is  a  prefix  of  the 
other,  which  violates  the  prefix  condition.  Since  the  A,’s  are  disjoint  subsets  of  A, 

EW  ^  M- 

1=1 

Note  that  [A]  =  [Sp]*'  and  [A,]  =  [Sp]^*  Therefore, 

in  [F] 

i=l  t=l 


5](5:p]-»(p>)  <  1 


The  principle  result  of  this  section,  the  average-minimum  program  length  lower 
bound,  can  now  be  stated  and  proven. 

Theorem  A. 16  If  (ikf,  P,  F)  is  a  machine  with  probability  measure  U  on  F  then 
Si-'  >  if/log[Sp].  That  is,  the  average-minimum  program  length  is  greater  than  or 
equal  to  the  entropy  of  F  with  respect  to  11  divided  by  the  log  of  the  program  alphabet 
size.  The  base  of  the  logarithm  in  the  denominator  of  the  right  hand  side  of  the 
inequality  is  determined  by  the  units  of  H  (i.e.  the  log  used  in  computing  H  and  that 
used  in  the  inequality  must  have  the  same  base). 

Proof: 

We  show  that  H  —  iS'plog([Sp])  <  0. 

[F]  1  in 

H  -  Sr  iog(|Ep|)  =  E  n(/.)  log  ^  -  E  logp,.) 

1=1  ;=i 

in  1  in 

=  En{/.)iog  +  Enc/OioglS/r’'"’ 

I'')  r  1  1 

=  Sn(/,)|loggj^+log|Spl-«| 


184 


Let  z  —  then 


(n  [F] 

H  -  Sf  log[Sp]  =  £  n(/,)  log(z)  =  log  e  S  n(/.)  In  z 

i=l  i=l 


From  Lemma  A.l,  In  z  <  2  —  1,  thus 

[F] 


H-Sp  log[Sp]  <  log  e  £  U{fi){z  -  1)  =  log  e  S  n(/.)  "  7 

(F)  /(F)  (F)  > 

=  log  e  -  n(/.)  =  log  e  ElSp]"*^'"^  -  E  n(/0 

t=i  \t=i 

/(/•I 

=:loge  E[Spr^'"^-l 
V«+i 


t=l 


From  Lemma  A. 2, 


(n 

ElsH-  <  1  therefore,  H  —  5plog[Sp]  <  0. 

i=t 


□ 


For  finite  [F],  we  can  quickly  establish  a  generalized  form  (with  respect  to  11)  of 
the  Max-Min  Program  Length  Lower  Bound. 

Corollary  A.l  If  (M,  P,  F)  is  a  machine  with  probability  measure  TL  on  F  then 

s{p')  >  fr/iog[Sp]. 

Proof: 


(F)  (F|  [F] 

Sr  =  En(/i)»{p.)  <  En(/.)»(p')  =  ‘>(p')En(/r)  = »(/) 

i=l  {=1  1=1 

and  Sf  >  jy/log(Sp]  by  Theorem  A. 16.  Therefore,  s(p')  >  Sp>  ff/log[Sp]. 

□ 


Suppose  that  all  of  the  functions  in  F  are  equally  probable;  i.e.  n(/)  =  1/(F]  for 
all/eF. 

Corollary  A. 2  If  (M,  P,  F)  is  a  machine  with  a  probability  measure  U  on  F  such 
that  n(/,-)  =  n(/j)  for  all  i,j  then  Sf  >  log{F]/log[Sp], 


185 


Proof: 

The  entropy  (H)  of  jP  is  : 


^  =  -EW)iosn(/)  =  -Eiio4  =  E!^ 

By  Theorem  A. 16,  we  have  that  Sp  >  ;fif/log[Sp]  =  log[F]/log[Ep]. 

□ 


Therefore,  the  average  minimum  program  length  is  greater  than  or  equal  to  log[i^]. 
Not  only  does  there  exist  a  program  of  length  log[i^]  or  longer  (Theorem  A. 7),  the 
average  length  is  log[F]  or  longer.  The  following  corollary  gives  the  lower  bound  with 
respect  to  /  and  s(a;). 


Corollary  A.3  If  (M,  jP,  F)  is  a  machine  with  equally  likely  probability  measure  11 
on  F  and  F  is  the  set  of  all  total  functions  of  the  form  f  i  X  —*Y  then 

Sf  >  [X] 

Further,  if  X  is  the  set  of  all  vectors  of  length  s(v)  on  Sa’  then 


iog|y) 

loglUp) 


Sp  ^ 


[s.Y]^<^Hog[r] 

log[Sp] 


Proof: 

Substitute  [F]  =  (by  Theorem  A.4)  into  Corollary  A. 2  to  get  the  first  result. 
Substitute  [X]  =  (by  Theorem  A.l)  into  the  first  part  to  get  the  second  part. 

□ 


The  following  corollary  demonstrates  that  if  there  is  a  short  program  then  there 
must  also  be  a  long  program. 

Corollary  A.4  If{M,P,F)  is  a  Programmable  Machine  where  there  exists  an  f  E  F 
such  that  f  =  M{pi),pi  €  Pmin;  ^(pi)  <  ^F  there  exists  a  g  in  F  such  that 
g  =  M{p2),P2  e  Fmin;  and  s{p2)  >  Sf- 

Proof: 

Suppose  to  the  contrary,  that  is,  s(pi)  <  Sp  but  there  does  not  exist  g  such  that 
g  =  M{p2),P2  €  PmiiM  and  5(^2)  >  Sp.  In  other  words,  for  all  p  €  Pmin,s{p)  <  Sp. 
However 

[n  in 

=  YlWiHpi)  =  n(/i)5(pi)  +  X)n(/.)s(p,) 

i=l  1=1 


186 


10  REH  A  BASICA  Program  to  Demonstrate  a  Tabular  Data  Structure 
20  DIM  F(9) 

30  DATA/3,2,7,5,4,7,8,3,6,4/ 

40  FOR  1=0  TO  9 
50  READ  F(I) 

60  NEXT  I 
70  INPUT  X 
80  PRINT  F(X) 

90  GOTO  70 
100  END 


Figure  A.4:  BASIC  Allows  for  Tabular  Data  Structures 

m 

<n(/,)SF  +  En(/.)SF 

«=! 

since  s(pi)  <  Sp  and,  for  all  p  €  Pm\tn^{p)  <  Sp.  However, 

n(/,)SF  +  Elfin(/i)5F  =  SfE  l^'|n(/i)  =  Sr. 

1=1  1=1 


Therefore,  Sp  <  Sp,  which  is  a  contradiction. 

□ 


The  average-minimum  program  length  lower  bound  of  Section  A.2.2  combined 
with  the  exponential  relationship  between  input  size  and  function  size  will  be  used 
to  show  that  realistically  programmable  functions  are  an  extremely  small  fraction  of 
possible  functions.  At  any  given  time  there  is  some  upper  bound  {Bp)  on  the  length 
of  realistic  program  lengths.  For  example,  limited  by  today’s  technology,  there  are 
no  programs  of  lengths  greater  than  say  10^”  bits.  Let  Bx  be  an  upper  limit  on 
allowed  function  input  sizes,  allowing  larger  inputs  would  only  make  the  fraction  of 
realistically  programmable  functions  even  smaller.  Most  computer  languages  allow 
for  the  representation  of  a  function  in  a  basically  tabular  structure  (e.g.  Figure  A.4). 

We  limit  our  discussion  in  this  section  to  machines  which  allow  tabular  programs. 
Any  traditional  computer  language  is  included.  Some  machine  learning  is  included, 
where  a  “tabular  program”  is  an  exhaustive  sample  set.  Some  circuit  design  situations 
could  be  included  here  also.  Therefore,  while  the  set  of  machines  which  allow  tabular 
programs  does  not  include  all  machines,  it  is  still  a  very  general  computer  model. 

We  want  to  characterize  the  size  of  tabular  programs  in  terms  of  a  “table”  and 
some  “overhead.”  In  Figure  A.4  the  numbers  in  the  DATA  statement  are  the  “table” 
and  the  rest  of  the  program  is  the  “overhead.”  We  can  design  a  program  whose 
overhead  is  essentially  constant  with  respect  to  increasing  function  size.  The  example 


187 


in  Figure  A.4  requires  changing  the  DIM  and  FOR  statements  (these  grow  as  the 
log  of  function  size);  however,  it  is  possible  to  design  around  such  statements.  Define 
M'  to  be  the  set  of  machines  which  allow  programs  with  lengths  <  [/]  +  K,  for  any 
/  and  some  constant  K.  Let  F  be  a  set  of  functions  on  inputs  of  length  Bx  •  Let  G 
(the  “good”  functions)  be  the  subset  of  F  such  that 

Cr  =  {/  G  F\3p  9  /  =  M'{p)  implies  that  s(p)  <  Bp}. 

Define  R  =  [(t]/[F],  the  fraction  of  F  with  programs  of  length  Bp  or  less.  This 
fraction  of  realistically  computable  functions  is  very  small. 

Theorem  A. 17  For  any  machine  in  M'^R  < 

Proof: 

Let  be  the  average-minimum  program  length  of  the  set  of  functions  $.  We  are 
especially  concerned  with  Sp^So  and  Sp-a^ 

„  (GISc  +  -  G15r.G 

m 

=  RSq  +  (1  —  R)Sp-G 

Since  every  function  in  /  has  a  tabular  program  of  cost  [f]  +  K^  the  average- minimum 
cost  over  F  (or  any  subset  of  F)  cannot  exceed  this.  Therefore, 

Sf-c  <  1/1  +  if  =  Pa]®*  +  K 


and 

Sc  ^  Bp 

from  the  definition  of  Bp.  Substituting  these  relations  into  the  expression  for  Spi 
Sp  =  RSc  +  (1  -  R)Sp.G  <  R{Bp)  +  (1  -  R)([S,y]®"  +  A') 


From  Theorem  A. 16,  Sp  >  [/]  =  [S.y]^®.  Therefore, 

[S.y]®*  <Sp<  R{BP)  -}-(!-  A)([Sa']®"  +  A) 

[S.y]®"  <  R{Bp)  +  [E.y]®"  +  a  -  A([Sa'1®"  +  A), 
A  >  R{Bp  -  (S.y)"^  +  A). 

Therefore, 

r<—-JL— 

-  (S.Yp-B,.-f  A’ 

□ 


We  see  that  the  fraction  of  ‘good’  (i.e.  realistically  programmable)  functions  goes 
as  the  inverse  of  the  difference  between  the  bound  in  program  size  {Bp)  and  the  bound 


188 


n 

log  ill 

logil2 

H 

-256 

-3.3546  X  10^ 
-1.1259  X  10*® 
-1.2676  X  10=*° 

>0 

-8.554  X  10° 
-1.1259  X  10*® 
-1.2676  X  10=*° 

Table  A.l;  Fraction  of  Functions  Computable  by  NN 

in  function  size  ([S.y]®®).  However,  the  bound  in  fu'iction  size  goes  up  exponentitiUy 
with  the  bound  in  input  size.  Therefore,  the  fracfon  of  ‘good’  functions  is  extremely 
small  for  any  reasonable  combination  of  bounds.  For  example,  even  if  we  say  that 
programs  up  to  10^“  bits  are  realistic,  then  less  than  10“^^  percent  of  the  functions 
on  64  bits  (e.g.  two  32  bit  integers)  have  realistic  programs. 

There  are  many  questionable  particulars  in  the  assumptions  of  Theorem  A. 17. 
For  example,  the  ‘overhead’  should  perhaps  be  log  or  linear  rather  than  constant. 
However,  the  point  remains  valid.  That  is,  the  fraction  of  functions  which  we  can 
reasonable  expect  to  program  is  extremely  small.  Functions  with  reasonable  pro¬ 
grams  are  very  special.  This  result  suggests  that  we  should  not  discount  represen¬ 
tation  methods  because  they  only  have  efficient  representations  on  a  small  class  of 
functions.  For  example,  one  reason  often  given  in  Switching  Theory  for  not  pursuing 
Function  Decomposition  as  a  design  method  is  that  only  a  small  fraction  of  functions 
decompose.  However,  it  was  just  demonstrated  that  any  representation  method  will 
only  efficiently  represent  a  small  fraction  of  functions. 

The  limitations  of  Neural  Nets  (NN)  are  quite  striking  when  we  interpret  the 
learned  parameters  of  the  net  as  its  program.*  A  fully  interconnected  NN  of  depth  d 
and  n  input  bits  has  approximately  dxn  parameters  which  can  be  adjusted  as  the 
net  learns.  Each  parameter  is  some  real  variable  of  say  r  bits.  Therefore,  the  program 
length  is  r  X  d  X  n.  The  number  of  programs  possible  on  such  a  NN  is  2’’’“*’'".  The 
number  of  functions  on  n  variables  is  2^".  The  fraction  of  computable  functions  then 
is  R  =  ;  which  gets  very  small  very  fast.  Ri  in  Table  A.l  is  for  r  =  64  and 

d  =  5,  which  is  a  reasonable  NN.  R2  is  for  r  =  1000  and  d  =  1000,  which  is  much 
larger  than  most  people  are  considering  for  NN’s.  Note  that  the  table  contains  the 
logarithm  of  R.  Therefore,  only  an  infinitesimal  fraction  of  functions  are  computable 
with  any  foreseeable  NN. 

We  now  demonstrate  a  machine  which  is  optimal  in  terms  of  <hc  average  required 
program  length.  Assume  that  the  finite  set  X  is  ordered.  Define  a  Table  Machine  as 
a  machine  whose  program  p,-  simply  lists,  in  order  of  X,  the  images  of  :  A'  — >  Y. 
Assume  that  F  is  the  set  of  all  total  functions  from  A'  into  Y,  where  Y  is  also  a  finite 
s®t.  Assume  that  the  elements  of  A,  when  considered  as  strings,  satisfy  the  prefix 
condition.  Finally  assume  that  all  elements  of  F  are  equally  probable. 

‘See  (2}  for  a  more  sophisticated  treatment  of  this  idea. 


189 


10  REM  A  Basica  Program  for  a  Z-248  as  a  demonstration  of  a  Table  Machine. 

20  DIM(9) 

30  REM  Accept  the  Program 
40  FOR  1=0  TO  9 
50  PRINT  ''F(" 

60  A$=INKEY$:  IF  A$=»»  THEM  60  « 

70  F(I)=VAL(A$) 

80  PRINT  F(I)  ^ 

90  NEXT  I 

100  PRINT  "Press  S  to  stop" 

110  REM  Realize  the  Function 
120  PRINT  "X="; 

130  A$=INKEY$:  IF  A$=""  THEN  130 
140  IF  A$="S"  THEN  190 
150  X»VAL(A$) 

160  PRINT  X 

170  PRINT  "F(X)»";F(X) 

180  GOTO  120 
190  END 


Figure  A. 5:  An  Example  Table  Machine 

An  example  of  a  Table  Machine  for  X  =  y  =  {0,1, 2, 3, 4, 5, 6, 7, 8, 9}  is  given  in 
Figure  A.5  (actually  the  Table  Machine  consists  of  a  PC  running  this  program).  This 
example  demonstrates  that  Table  Machines  exist  in  the  “real  world.” 

The  “program”  for  this  example  is  the  first  10  digits  that  one  enters  in  response 
to  the  F(I)=?  prompts.  We  now  prove  that  a  Table  Machine  is  optimal  in  terms  of 
average  required  program  length. 

Theorem  A. 18  If  X  is  an  ordered  finite  set  whose  elements  satisfy  the  prefix  con¬ 
dition,  Y  is  a  finite  set,  F  is  the  set  of  all  total  functions  of  the  form  f  :  X  ^Y,  the 
elements  of  F  are  equally  probable,  Pt  is  the  set  of  programs  associated  with  a  Ta¬ 
ble  Machine,  and  Pf^{  is  the  set  of  programs  associated  with  any  other  machine,  then 
Sf-t  ^  Sf-m-  That  is,  the  average-minimum  program  length  for  a  Table  Machine 
is  optimal. 

Proof: 

s{pt)  =  [A”]  for  Pt  €  Pt  from  the  definition  of  a  Table  Machine.  Sf-t  =  ^(pr)  since 
all  elements  of  F  are  equally  probable.  Therefore,  Sf-t  =  [A'].  For  an  arbitrary 
machine  M,  Sf-m  >  from  Corollary  A. 3.  However,  [F]  =  since  F  is  the 
set  of  all  total  functions  on  X  into  y  and  [y]  =  [Sp]  from  the  definition  of  a  Table 


190 


Machine.  Then, 


log[i^]  =  [JC]log[S/.)  and  Sr-M  > 


log[Sp] 


Therefore,  Sf-m  ^  [«^]  =  Sf-T' 

□ 


One  consequence  of  this  theorem  is  that  “powerful”  languages  (e.g.  Fortran  is  more 
“powerful”  than  a  machine  language)  do  not  lessen  the  average-minimum  program 
length.  The  intuition  that  more  powerful  languages  should  allow  a  simpler  expres¬ 
sion  of  a  relationship  is  shown  to  depend  a  non-equally  probable  distribution  of  the 
function  set.  This  suggests  a  connection  between  language  design  and  some  assumed 
(at  least  implicitly)  probability  distribution  on  F. 

Theorem  A. 19  establishes  that  the  bounds  of  Theorem  A. 16,  Corollary  A. 2  and 
Corollary  A.3  are  the  greatest  lower  bounds  when  the  assumptions  of  Theorem  A. 18 
are  met. 

Theorem  A. 19  If  the  conditions  of  Theorem  A,18  are  met,  then  ihe  Icwsr  bounds  of 
Theorem  A.  id,  Corollary  A. 2  and  Corollary  A.S  are  greatest  lower  bounds  of  average- 
minimum  program  length. 

Proof: 

If  there  were  a  greater  lower  bound  then  it  would  be  violated  by  a  Table  Machine. 

□ 


A.S  Summary 

This  appendix  develops  a  theory  of  computational  complexity  based  on  program 
length.  We  formally  defined  our  concept  of  programmable  machine  and  proved  nu¬ 
merous  properties  of  program  length.  A  programmable  machine  is  sufficiently  abstract 
to  include  many  different  kinds  of  problems.  Under  various  interpretations  the  pro¬ 
gram  length  results  become  memory  requirements,  learnability,  or  circuit  complexity 
results.  In  effect,  we  have  brought  into  an  engineering  setting  some  of  the  devel¬ 
opments  of  Shannon,  Turing,  Cl'aitin  and  others.  This  common  setting  then  has 
extended  applications,  such  as  learnability,  circuit  complexity  and  especially  Pattern 
Theory. 


191 


192 


Appendix  B 

Function  Decomposition  Program 
User’s  Guide 


A 


This  software  is  located  in  ficsim::user2:[vogtj^  There  are  10  (that’s  right  10  )  differ¬ 
ent  versions  characterized  as  follows: 


V.l. 

V.2. 

V.2A. 

V.2B. 

V.2AB. 

V.3. 

V.4. 

V.4A. 


V.4B. 

V,4AB. 


(8) :  Non-ahared,  exhaustive 

(2)  :  Non-shared,  negative  decompositions 
(0) ;  Non-shared,  negative  decompositions, 

greedy  search 

(3) :  Non-shared,  negative  decompositions, 

number  of  cares  cost 

(1) :  Non-shared,  negative  decompositions, 
number  of  cares  cost,  greedy 

(9)  :  Shared,  exhaustive 

(6)  ;  Shared,  negative  decompositions 

(4) :  Shared,  negative  decompositions, 

greedy  search 

(7) :  Shared,  negative  decompositions, 

number  of  cares  cost 

(5) :  Shared,  negative  decompositions, 

number  of  cares  cost,  greedy 


The  numbers  in  parentheses  indicate  relative  speed,  with  V.2A  being  the  fastest 
and  V.3  the  slowest.  These  are  only  rough  estimates,  and  may  even  be  slightly  wrong. 

To  run  the  program,  get  into  the  directory  appropriate  to  the  version  you  want 
to  run,  e.g. 


$  set  def  Cvogt.pbml.v4ab] 
and  then  type 

'This  User’s  Guide  was  written  by  Chris  Vogt. 


193 


run  pbml.driver 

It’s  that  simple.  The  program  will  then  come  up  with  the  following  prompt: 

SPECIFY  FUNCTION  TABLE  VALUES: 

Enter  0  to  input  from  a  file,  1  from  terminal, 

or  2  to  QUIT  progr€un:  t 

Interactive  Interface 

To  enter  the  function  in  question  interactively,  enter  a  1.  The  program  will  then 
prompt  you  for  the  information  needed.  The  prompts  are  relatively  staightforward. 

A  sample  session  to  input  a  “checkerboard”  on  3  variables  (i.e.  01011010)  is  shown 
below: 

Name  of  function:  checkerboards 
How  many  input  variables  does  the  function  have?3 
Enter  0  to  input  falses,  1  to  enter  trues:  1 
How  many  function  values  will  you  be  entering?  4 
Enter  negative  values  below  for  Don’t  Cares 

Enter  decimal  equivalent  of  binary  input  that  has  a  true  value:  1 

Enter  decimal. equivalent  of  binary  input  that  has  a  true  value:  3 

Enter  decimal  equivalent  of  binary  input  that  has  a  true  value:  4 

Enter  decimal, equivalent  of  binary  input  that  has  a  true  value:  6 

To  enter  “don’t  care”  values,  use  a  negative  number.  For  example,  if  we  v^anted  to 
enter  the  function  010X1010,  we  would  have  typed  -3  instead  of  3. 

Input  Prom  a  File 

If  you  are  going  to  be  using  a  function  often,  it  may  be  handy  to  store  it  in  a  file. 

To  input  a  function  from  a  file,  type  0  to  the  first  prompt.  The  program  will  then 
prompt  you  for  the  file  name.  Make  sure  you  give  the  full  file  specification  needed,  or 
the  program  will  crash.  The  function  is  stored  in  the  file  with  the  following  format: 

The  first  n  +  1  digits  indicate  the  number  of  input  variables  in  unary.  The  next  2" 
digits  are  the  function  itself.  Thus,  the  file  for  the  checkerboard  entered  above  would 
be: 

1 

1 

1  * 

0 

0 

I 

1 

0 

1 

1 


194 


0 

1 

0 

Note  that  the  first  4  digits  (1110)  tell  us  that  n  =  3,  and  the  next  8  are  the  actual 
function.  To  indicate  a  “don’t  care”  in  the  function,  use  a  2  instead  of  a  0  or  1. 


Non-interactive  Runs/Batch  Jobs 

Sometimes  you  may  want  to  make  runs  of  several  functions  and  save  the  output. 
This  can  be  done  simply  using  a  command  file,  A  sample  command  file,  BATCH- 
DRIVER.GOM,  is  found  in  each  version’s  [.TEST]  subdirectory.  It  consists  of  two 
lines: 

$  set  def  [vogt .pbml.vl .test] 

$  run/detached/ input»input . dat/output=output . dat  [-] pbml.driver 

To  utilize  this  file,  you  must  first  create  a  file  input.dat  to  be  used  as  input  to  the 
program.  It  should  contain  exactly  what  you  would  type  if  you  ran  the  program.  A 
sample  input.dat  is  shown  here: 

0 

Cvogt . pbml . data] CHECK6 . DAT 
0 

[vogt . pbml . data] LETTERA . DAT 
0 

[vogt , pbml . data] MIKES.EXAM . DAT 
1 

checkerboards 

3 
1 

4 
1 

3 

4 
6 
2 

Note,  this  input  file  will  first  decompose  three  functions  which  are  specified  in  files 
(check6.dat,  lettera.dat,  mikes-exam.dat)  and  then  will  decompose  the  checkerboard 
on  3  variables  as  in  the  example  shown  previously.  The  last  2  in  the  file  is  the 
command  to  quit  the  program. 

After  having  created  input.dat,  one  runs  the  program  by  issuing  the  command: 

$  submit/nolog/noprint /notify  batch_driver.com 


195 


The  output  from  the  program  will  be  saved  in  the  file  output.dat,  which  will  not  be 
accessible  until  the  program  is  done  running. 


Recompilation 

If  changes  are  made  to  the  source  code,  recompilation  is  simple.  In  each  ver-  i 

sion’s  directory,  there  is  a  com  file  called  comp*. com  which  does  the  job.  The*  is 
replaced  with  the  version  number.  Thus,  in  [vogt.pbml.v2ab]  there  is  a  file  called 
compv2ah.com.  This  file  can  be  executed  interactively  (with  the  @  sign),  or  can  be 
submitted  as  a  batch  job. 

Two  WARNINGS  about  recompiling: 

•  At  present  there  is  just  barely  enough  disk  quota  to  handle  all  of  the  files 
created  during  compilation.  You  need  about  1300  blocks  free,  so  check  the 
quota  beforehand. 

•  After  compiling,  to  get  rid  of  all  of  the  ’’unnecessary”  files  created,  execute  the 
command  file  killada.com  found  in  [vogt.com]. 


196 


* 


4 


Bibliography 


[1]  Thomas  Abraham.  Pattern  Recognition:  Machine  vs.  Man.  Final  Report, 
USAF-UES  Summer  Faculty  Research  Program,  July  1990. 

[2]  Ysar  S.  Abu-Mostafa.  Information  theory,  complexity  and  neural  networks. 
IEEE  Communications  Magazine,  27(ll):25-28,  1989. 

[3]  A.  V.  Aho,  J.  E.  Hopcroft,  and  J.  D.  Ullman.  Data  Structures  and  Algorithms. 
Addison-Wesley,  Reading,  Massachusetts,  1983. 

[4]  Robert  L.  Ashenhurst.  The  decomposition  of  switching  functions.  In  Proceedings 
of  the  International  Symposium  on  the  Theory  of  Switching,  April  1957.  Also 
in  The  Annals  of  the  Harvard  Computational  Laboratory  XXXIX,  Harvard  Uni¬ 
versity  Press,  Cambridge,  Massachusetts,  1959,  pages  74-116,  and  in  Curtis’62 
pages  571-602. 

[5]  Mark  Boeke.  Pattern  Based  Machine  Learning.  Final  Report,  AFOSR  High 
School  Apprenticeship  Program,  August  1990. 

[6]  John  D.  Bransford  and  Barry  S.  Stein.  The  IDEAL  Problem  Solver.  W.  H. 
Freeman  and  Company,  New  York,  1984. 

[7j  Gilles  Brassard  and  Paul  Bratley.  Algorithmics:  Theory  and  Practice.  Prentice 
Hall,  Englewood  Cliffs,  New  Jersey,  1988. 

[8]  Mike  Breen.  Some  Results  in  Pattern-Based  Machine  Learning.  Final  Report, 
USAF  Summer  Faculty  Research  Program,  1990. 

[9]  Richard  Burden  and  J.  Faires.  Numerical  Analysis.  Prindle,  Weber  and  Schmidt, 
Boston,  third  edition,  1985. 

[10]  Maureen  Caudill.  Neural  networks  primer.  AI  Expert,  3(6):53-59,  June  1988. 

[11]  Michael  Chabinyc.  Pattern  Based  Machine  Learning:  A  Comparison  of  Ada 
Function  Decomposer  Versions.  Final  Report,  AFOSR  High  School  Apprentice¬ 
ship  Program,  August  1990. 

[12]  Gregory.  J.  Chaitin.  Algorithmic  Information  Theory.  Cambridge  University 
Press,  New  York,  1987. 


197 


[13]  Paul  R.  Cohen  and  Edward  A.  Feigenbaum,  editors.  The  Handbook  of  Artificial 
Intelligence.  Addison- Wesley,  Reading,  Massachusetts,  1982. 

[14]  H.  Allen  Curtis.  A  New  Approach  to  The  Design  of  Switching  Circuits.  D.  Van 
Nostrand  Company,  Princeton,  New  Jersey,  1962. 

[15]  Lawrence  Davis.  Genetic  Algorithms  and  Simulated  Annealing.  Morgan  Kauf- 
mann,  Palo  Alto,  California,  1987. 

[16]  Pierre  A.  Devijver  and  Josef  Kittler.  Pattern  Recognition:  A  Statistical  Ap¬ 
proach.  Prentice  Hall,  Englewood  Clilfs,  New  Jersey,  1982. 

[17]  D.  Henry  Edel.  Introduction  to  Creative  Design.  Prentice  Hall,  Englewood  Cliffs, 
New  Jersey,  1967. 

[18]  Michael  J.  Findler.  Neural  Networks  and  Machine  Learning.  Graduate  Student 
Research  Program  Final  Report,  Air  Force  Office  of  Scientific  Research,  August 

1989. 

[19]  M.  J.  Fischer  and  N.  Pippenger.  M.  J.  Fischer  Lecture  Notes  on  Network  Com¬ 
plexity,  Universitat  Frankfurt,  Frankfurt,  1974. 

[20]  A.  A.  Fraenkel.  Abstract  Set  Theory.  North-Holland  Publishing  Company,  Am¬ 
sterdam,  1953. 

[21]  Arthur  D.  Friedman.  Fundamentals  of  Logic  Design  and  Switching  Theory.  Com¬ 
puter  Science  Press,  Rockville,  Maryland,  1986. 

[22]  King  Sun  Fu.  Syntactic  Pattern  Recognition  and  Applications.  Prentice  Hall, 
Englewood  Cliffs,  New  Jersey,  1982. 

[23]  R.  Gallager.  Information  Theory  and  Reliable  Communications.  Wiley,  New 
York,  1968. 

[24]  Wendell  Garner.  Good  patterns  have  few  alternatives.  In  I.  Janis,  editor.  Cur¬ 
rent  Trends  in  Psychology,  pages  185-192,  William  Kaufmann  Inc.,  Los  Altos, 
California,  1977. 

[25]  Thomas  K.  Gearhart.  Investigations  of  a  Lower  Bound  on  the  Error  in  Learned 
Functions.  Final  Report,  USAF-UES  Summer  Faculty  Research  Program,  July 

1990. 

[26]  Donald  D.  Givom*.  Introduction  to  Sioitrhing  Circuit  Theory.  McGraw-Hill,  New 
York,  1970. 

[27]  Gilbert  Held  and  Thomas  Marshall.  Data  Compression.  Wiley,  New  York,  sec¬ 
ond  edition,  1987. 

[28]  Vladimir  Hubka.  Theory  of  Technical  Systems.  Springer- Verlag,  New  York,  1988. 


198 


[29]  Laveen  N.  Kanal.  Patterns  in  pattern  recognition:  1968-1974.  IEEE  Transac¬ 
tions  on  Information  Theory,  20:697-722,  November  1974. 

[30]  John  Langenderfer.  A  Study  of  ike  Computational  Complexities  of  Functions 
and  their  Inverses.  Memorandum,  Wright  Research  and  Development  Center, 
WL/AART  Wright  Patterson  AFB,  OH,  September  1989. 

[31]  Pat  Langley,  editor.  Proceedings  of  the  Fourth  International  Workshop  on  Ma¬ 
chine  Learning,  Morgan  Kaufmann,  Palo  Alto,  California,  1987. 

[32]  Pat  Langley,  Herbert  A.  Simon,  Gary  L.  Bradshaw,  and  Jan  M.  Zytkow.  Scien¬ 
tific  Discovery:  Computational  Explorations  of  the  Creative  Process,  The  MIT 
Press,  Cambridge,  Massachusetts,  1987. 

[33]  Ming  Li  and  Paul  M.  B.  Vitanyi.  Kolmogorov  complexity  and  its  applications.  In 
Jan  Van  Leeuwen,  editor.  Handbook  of  Theoretical  Computer  Science,  chapter  4, 
pages  189-254,  The  MIT  Press,  Cambridge,  Massachusetts,  1990. 

[34]  0.  B.  Lupanov.  A  method  of  circuit  synthesis.  Izv.  V.U.Z.  Radiofiz,,  1(1):120- 
140,  1958. 

[35]  M.  Machtey  and  P.  R.  Young.  Theory  of  Algorithms,  Elsevier  North-Holland, 
New  York,  1978. 

[36]  Peter  S.  Maybeck.  Stochastic  Models,  Estimation  and  Control,  Volume  1,  Aca¬ 
demic  Press,  New  York,  1979. 

[37]  Peter  S.  Maybeck.  Stochastic  Models,  Estimation  and  Control.  Volume  2,  Aca¬ 
demic  Press,  New  York,  1982. 

[38]  Loren  P.  Meissner.  The  Science  of  Computing.  Wadsworth  Publishing,  Belmont, 
California,  1974. 

[39]  James  L.  Melsa  and  David  L.  Cohn.  Decision  and  Estimation  Theory.  McGraw- 
Hill,  New  York,  1978. 

[40]  J.  L.  Meriam.  Statics.  Wiley,  New  York,  second  edition,  1971. 

>  [41]  Paul  L.  Meyer.  Introductory  Probability  and  Statistical  Applications.  Addison- 

Wesley,  Reading,  Massachusetts,  second  edition,  1970. 

[42]  Ryszard  Michalski,  Jaime  Carbonell,  and  Tom  M.  Mitchell.  Machine  Learning: 
An  Artificial  Intelligence  Approach.  Volume  1,  Morgan  Kaufmann,  Palo  Alto, 
California,  1983. 

[43]  G.  A.  Miller.  The  magic  number  seven,  plus  or  minus  two:  some  limits  on  our 
capacity  for  processing  information.  The  Psychological  Review,  63:81-97,  March 
1956. 


199 


[44]  Gerald  J.  Montgomery  and  Keith  C.  Drake.  Abductive  networks.  In  SPIE 
Applications  of  Neural  Networks  Conference,  April  1990. 

[45]  Saburo  Muroga.  Logic  Design  and  Switching  Theory.  Wiley,  New  York,  1979. 

[46]  A.  Oppenheim  and  R.  Schafer.  Digital  Signal  Processing.  Prentice  Hall,  Engle¬ 
wood  Cliffs,  New  Jersey,  1975. 

[47]  Engineering  Concepts  Curriculum  Project.  Man  and  His  Technology.  McGraw- 
Hill,  New  York,  1973. 

[48]  Timothy  D.  Ross.  Elementary  Theorems  in  Pattern  Theory.  PhD  thesis.  Air 
Force  Institute  of  Technology,  1988. 

[49]  Timothy  D.  Ross.  Pattern  Representation  and  Recognition.  Research  Prospec¬ 
tus,  Air  Force  Institute  of  Technology,  1986. 

[50]  Timothy  D.  Ross  and  Alan  V.  Lair.  Definition  and  realization  in  pattern  recog¬ 
nition  system  design.  In  Proceedings  of  the  1987  IEEE  Int.  Conf.  on  Systems, 
Man  and  Cybernetics,  pages  744-748,  1987. 

[51]  Timothy  D.  Ross  and  Alan  V.  Lair.  On  the  role  of  patterns  in  recognizer  design. 
In  Josef  Kittler,  editor.  Pattern  Recognition,  pages  193-202,  Springer- Verlag, 
New  York,  1988. 

[52]  J.  Paul  Roth.  Minimization  over  Boolean  trees.  IBM  Journal,  543-558,  Novem¬ 
ber  1960. 

[53]  David  E.  Rumelhart,  James  L.  McClelland,  and  the  PDP  Research  Group.  Par¬ 
allel  Distributed  Processing;  Explorations  in  the  Microstructure  of  Cognition. 
Volume  1,  The  MIT  Press,  Cambridge,  Massachusetts,  1986. 

[54]  John  E.  Savage.  The  Complexity  of  Computing.  Wiley,  New  York,  1976. 

[55]  C.  P.  Schnorr.  The  network  complexity  and  the  Turing  machine  complexity  of 
finite  functions.  Acta  Informat.,  7:95-107,  1976. 

[56]  Claude  E.  Shannon.  A  mathematical  theory  of  communication.  Bell  Systems 
Tech.  Journal,  27:379-423  (Part  I),  623-656  (Part  II),  1948.  Reprinted  in  book 
form  with  postscript  by  W.  Weaver,  Univ.  of  Illinois  Press,  Urbana,  1949. 

[57]  Harold  A.  Simon.  A  Student’s  Introduction  to  Engineering  Design.  Pergamon 
Press,  New  York,  1975. 

[58]  Jean-Claude  Simon.  Patterns  and  Operators:  The  Foundations  of  Data  Repre¬ 
sentation.  McGraw-Hill,  New  York,  1986.  Translated  by  J.  Howlett. 

[59]  Thomas  A.  Sudkamp.  Languages  and  Machines;  An  Introduction  to  the  Theory 
of  Computation.  Addison- Wesley,  Reading,  Massachusetts,  1988. 


200 


[60]  H.  C.  Torng.  Switching  Circuits:  Theory  and  Logic  Design.  Addison- Wesley, 
Reading,  Massachusetts,  1972. 

[61]  J.  F.  Traub,  G.  W.  Wasilkowski,  and  H.  Wozniakowski.  Information-Based  Com¬ 
plexity.  Academic  Press,  New  York,  1988. 

[62]  Leonard  Uhr,  editor.  Pattern  Recognition:  Theory,  Experiment,  Computer  Sim¬ 
ulation,  and  Dynamic  Models  of  Form  Perception  and  Discovery.  Wiley,  New 
York,  1966. 

[63]  Satosi  Watanabe.  Pattern  Recognition:  Human  and  Mechanical.  Wiley,  New 
York,  1985. 

[64]  Ingo  Wegener.  The  Complexity  of  Boolean  Functions.  Wiley,  New  York,  1987. 

[65]  William  Allen  Whitworth.  Choice  and  Chance.  Hafner  Publication  Company, 
New  York,  1951. 

[66]  H.  Woolf,  editor.  Webster’s  New  Collegiate  Dictionary.  G.  and  C.  Merrian  Co., 
Springfield,  Mass.,  1973. 


201 


*U.S.  Government  Printing  Office:1991 — 648-127/62064 


